Unicode problem in ucs4

Mon Mar 23 09:05:44 EDT 2009

On Mar 23, 4:57 pm, abhi <abhigyan_agra... at in.ibm.com> wrote:
> On Mar 23, 4:37 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
>
>
>
> > On 2009-03-23 11:50, abhi wrote:
>
> > > On Mar 23, 3:04 pm, "M.-A. Lemburg" <m... at egenix.com> wrote:
> > > Thanks Marc, John,
> > >          With your help, I am at least somewhere. I re-wrote the code
> > > to compare Py_Unicode and wchar_t outputs and they both look exactly
> > > the same.
>
> > > #include<Python.h>
>
> > > static PyObject *unicode_helper(PyObject *self,PyObject *args){
> > >    const char *name;
> > >    PyObject *sampleObj = NULL;
> > >            Py_UNICODE *sample = NULL;
> > >    wchar_t * w=NULL;
> > >    int size = 0;
> > >    int i;
>
> > >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
> > >                 return NULL;
> > >       }
>
> > >         // Explicitly convert it to unicode and get Py_UNICODE value
> > >         sampleObj = PyUnicode_FromObject(sampleObj);
> > >         sample = PyUnicode_AS_UNICODE(sampleObj);
> > >         printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
> > > (sampleObj));
> > >         w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
> > > (wchar_t));
> > >    size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
> > > +1)*sizeof(wchar_t));
> > >    printf("%d chars are copied to w\n",size);
> > >    printf("size of wchar_t is : %d\n", sizeof(wchar_t));
> > >    printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
> > >    for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
> > >            printf("sample is : %c\n",sample[i]);
> > >            printf("w is : %c\n",w[i]);
> > >    }
> > >    return sampleObj;
> > > }
>
> > > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
> > > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
>
> > > void initunicodeTest(void){
> > >    Py_InitModule3("unicodeTest",funcs,"");
> > > }
>
> > > This gives the following output when I pass "abc" as input:
>
> > > size of sampleObj is : 3
> > > 3 chars are copied to w
> > > size of wchar_t is : 4
> > > size of Py_UNICODE is: 4
> > > sample is : a
> > > w is : a
> > > sample is : b
> > > w is : b
> > > sample is : c
> > > w is : c
>
> > > So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
> > > \0s after a char, printf or wprintf is only printing one letter.
> > > I need to further process the data and those libraries will need the
> > > data in UCS2 format (2 bytes), otherwise they fail. Is there any way
> > > by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
> > > data to UCS2 explicitly?
>
> > Sure: just use the appropriate UTF-16 codec for this.
>
> > /* Generic codec based encoding API.
>
> >    object is passed through the encoder function found for the given
> >    encoding using the error handling method defined by errors. errors
> >    may be NULL to use the default method defined for the codec.
>
> >    Raises a LookupError in case no encoder can be found.
>
> >  */
>
> > PyAPI_FUNC(PyObject *) PyCodec_Encode(
> >        PyObject *object,
> >        const char *encoding,
> >        const char *errors
> >        );
>
> > encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
> > for big endian.
>
> > --
> > Marc-Andre Lemburg
> > eGenix.com
>
> > Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
> > >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
> > >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
>
> > ________________________________________________________________________
> > 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/
>
> > ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
>
> >    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
> >     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> >            Registered at Amtsgericht Duesseldorf: HRB 46611
> >                http://www.egenix.com/company/contact/
>
> Thanks, but this is returning PyObject *, whereas I need value in some
> variable which can be printed using wprintf() like wchar_t (having a
> size of 2 bytes). If I again convert this PyObject to wchar_t or
> PyUnicode, I go back to where I started. :)
>
> -
> Abhigyan

Hi Marc,
       Is there any way to ensure that wchar_t size would always be 2
instead of 4 in ucs4 configured python? Googling gave me the
impression that there is some logic written in PyUnicode_AsWideChar()
which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
and wchar_t are different.

-
Abhigyan