[Python-Dev] PEP 393: Flexible String Representation

Fri Jan 28 22:49:08 CET 2011

> The nice thing about Py_UNICODE is that is basically gives you native
> Unicode code points directly, without needing to decode UTF-8 byte runs
> and the like. In Cython, it allows you to do things like this:
> 
>     def test_for_those_characters(unicode s):
>         for c in s:
>             # warning: randomly chosen Unicode escapes ahead
>             if c in u"\u0356\u1012\u3359\u4567":
>                 return True
>         else:
>             return False
> 
> The loop runs in plain C, using the somewhat obvious implementation with
> a loop over Py_UNICODE characters and a switch statement for the
> comparison. This would look a *lot* more ugly with UTF-8 encoded byte
> strings.

And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
representation for such a loop. Instead, it should access the str
representation, and might compile this to code like

#define Cython_CharAt(data, kind, pos) kind==LATIN1 ? \
             ((unsigned char*)data)[pos] : kind==UCS2 ? \
             ((unsigned short*)data)[pos] : \
             ((Py_UCS4*)data)[pos]

     void *data = PyUnicode_Data(s);
     int kind = PyUnicode_Kind(s);
     for(int pos=0; pos < PyUnicode_Size(s); pos++){
       Py_UCS4 c = Cython_CharAt(data, kind, pos);
       Py_UCS4 tmp = {0x356, 0x1012, 0x3359, 0x4567};
       for (int k=0; k<4; k++)
         if (c == tmp[k])
              return 1;
     }
     return 0;

> Regarding Cython specifically, the above will still be *possible* under
> the proposal, given that the memory layout of the strings will still
> represent the Unicode code points. It will just be trickier to implement
> in Cython's type system as there is no longer a (user visible) C type
> representation for those code units.

There is: Py_UCS4 remains available.

> It can be any of uchar, ushort16 or
> uint32, neither of which is necessarily a 'native' representation of a
> Unicode character in CPython.

There won't be a "native" representation anymore - that's the whole
point of the PEP.

> While I'm somewhat confident that I'll
> find a way to fix this in Cython, my point is just that this adds a
> certain level of complexity to C code using the new memory layout that
> simply wasn't there before.

Understood. However, I think it is easier than you think it is.

Regards,
Martin