Unicode problem in ucs4

Mon Mar 23 06:50:08 EDT 2009

On Mar 23, 3:04 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
> On 2009-03-23 08:18, abhi wrote:
>
>
>
> > On Mar 20, 5:47 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
> >>> unicodeTest.c
> >>> #include<Python.h>
> >>> static PyObject *unicode_helper(PyObject *self,PyObject *args){
> >>>    PyObject *sampleObj = NULL;
> >>>            Py_UNICODE *sample = NULL;
> >>>       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
> >>>                 return NULL;
> >>>       }
> >>>     // Explicitly convert it to unicode and get Py_UNICODE value
> >>>       sampleObj = PyUnicode_FromObject(sampleObj);
> >>>       sample = PyUnicode_AS_UNICODE(sampleObj);
> >>>       wprintf(L"database value after unicode conversion is : %s\n",
> >>> sample);
> >> You have to use PyUnicode_AsWideChar() to convert a Python
> >> Unicode object to a wchar_t representation.
>
> >> Please don't make any assumptions on what Py_UNICODE maps
> >> to and always use the the Unicode API for this. It is designed
> >> to provide a portable interface and will not do more conversion
> >> work than necessary.
>
> > Hi Mark,
> >      Thanks for the help. I tried PyUnicode_AsWideChar() but I am
> > getting the same result i.e. only the first letter.
>
> > sample code:
>
> > #include<Python.h>
>
> > static PyObject *unicode_helper(PyObject *self,PyObject *args){
> >         PyObject *sampleObj = NULL;
> >         wchar_t *sample = NULL;
> >         int size = 0;
>
> >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
> >                 return NULL;
> >       }
>
> >          // use wide char function
> >       size = PyUnicode_AsWideChar(databaseObj, sample,
> > PyUnicode_GetSize(databaseObj));
>
> The 3. argument is the buffer size in bytes, not code points.
> The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
> bytes without a trailing NUL, otherwise sizeof(wchar_t) *
> (PyUnicode_GetSize(databaseObj) + 1).
>
> You also have to allocate the buffer to store the wchar_t data in.
> Passing in a NULL pointer will result in a seg fault. The function
> does not allocate a buffer for you:
>
> /* Copies the Unicode Object contents into the wchar_t buffer w.  At
>    most size wchar_t characters are copied.
>
>    Note that the resulting wchar_t string may or may not be
>    0-terminated.  It is the responsibility of the caller to make sure
>    that the wchar_t string is 0-terminated in case this is required by
>    the application.
>
>    Returns the number of wchar_t characters copied (excluding a
>    possibly trailing 0-termination character) or -1 in case of an
>    error. */
>
> PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
>     PyUnicodeObject *unicode,   /* Unicode object */
>     register wchar_t *w,        /* wchar_t buffer */
>     Py_ssize_t size             /* size of buffer */
>     );
>
>
>
> >       printf("%d chars are copied to sample\n", size);
> >       wprintf(L"database value after unicode conversion is : %s\n",
> > sample);
> >       return Py_BuildValue("");
>
> > }
>
> > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
> > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
>
> > void initunicodeTest(void){
> >         Py_InitModule3("unicodeTest",funcs,"");
>
> > }
>
> > This prints the following when input value is given as "test":
> > 4 chars are copied to sample
> > database value after unicode conversion is : t
>
> > Any ideas?
>
> > -
> > Abhigyan
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
> >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
> >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
>
> ________________________________________________________________________
> 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/
>
> ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
>
>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611
>                http://www.egenix.com/company/contact/

Thanks Marc, John,
         With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
	const char *name;
	PyObject *sampleObj = NULL;
      	Py_UNICODE *sample = NULL;
	wchar_t * w=NULL;
	int size = 0;
	int i;

      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }

        // Explicitly convert it to unicode and get Py_UNICODE value
        sampleObj = PyUnicode_FromObject(sampleObj);
        sample = PyUnicode_AS_UNICODE(sampleObj);
        printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
        w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
	size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
	printf("%d chars are copied to w\n",size);
	printf("size of wchar_t is : %d\n", sizeof(wchar_t));
	printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
	for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
		printf("sample is : %c\n",sample[i]);
		printf("w is : %c\n",w[i]);
	}
	return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
	Py_InitModule3("unicodeTest",funcs,"");
}

This gives the following output when I pass "abc" as input:

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

-
Abhigyan