[Python-Dev] PEP 393: Flexible String Representation
"Martin v. Löwis"
martin at v.loewis.de
Fri Jan 28 22:49:08 CET 2011
> The nice thing about Py_UNICODE is that is basically gives you native
> Unicode code points directly, without needing to decode UTF-8 byte runs
> and the like. In Cython, it allows you to do things like this:
>
> def test_for_those_characters(unicode s):
> for c in s:
> # warning: randomly chosen Unicode escapes ahead
> if c in u"\u0356\u1012\u3359\u4567":
> return True
> else:
> return False
>
> The loop runs in plain C, using the somewhat obvious implementation with
> a loop over Py_UNICODE characters and a switch statement for the
> comparison. This would look a *lot* more ugly with UTF-8 encoded byte
> strings.
And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8
representation for such a loop. Instead, it should access the str
representation, and might compile this to code like
#define Cython_CharAt(data, kind, pos) kind==LATIN1 ? \
((unsigned char*)data)[pos] : kind==UCS2 ? \
((unsigned short*)data)[pos] : \
((Py_UCS4*)data)[pos]
void *data = PyUnicode_Data(s);
int kind = PyUnicode_Kind(s);
for(int pos=0; pos < PyUnicode_Size(s); pos++){
Py_UCS4 c = Cython_CharAt(data, kind, pos);
Py_UCS4 tmp = {0x356, 0x1012, 0x3359, 0x4567};
for (int k=0; k<4; k++)
if (c == tmp[k])
return 1;
}
return 0;
> Regarding Cython specifically, the above will still be *possible* under
> the proposal, given that the memory layout of the strings will still
> represent the Unicode code points. It will just be trickier to implement
> in Cython's type system as there is no longer a (user visible) C type
> representation for those code units.
There is: Py_UCS4 remains available.
> It can be any of uchar, ushort16 or
> uint32, neither of which is necessarily a 'native' representation of a
> Unicode character in CPython.
There won't be a "native" representation anymore - that's the whole
point of the PEP.
> While I'm somewhat confident that I'll
> find a way to fix this in Cython, my point is just that this adds a
> certain level of complexity to C code using the new memory layout that
> simply wasn't there before.
Understood. However, I think it is easier than you think it is.
Regards,
Martin
More information about the Python-Dev
mailing list