Python Unicode handling wins again -- mostly

MRAB python at mrabarnett.plus.com
Mon Dec 2 16:27:23 EST 2013


On 02/12/2013 21:14, Ned Batchelder wrote:
> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>>
>>> Out of the nine tests, Python 3.3 passes six, with three tests being
>>> failures or dubious. If you believe that the native string type should
>>> operate on code-points, then you'll think that Python does the right
>>> thing.
>>
>> I think Python is doing it correctly.  If I want to operate on
>> "clusters" I'll normalize the string first.
>>
>> Thanks for this excellent post.
>>
>> --
>> ~Ethan~
>
> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> that some grapheme clusters (or whatever the right word is) can't be
> normalized down to a single code point?  Characters can accept many
> accents, for example.  In that case, you can't always normalize and use
> the existing string methods, but would need more specialized code.
>
A better way of saying it is that there are codepoints for some grapheme
clusters. Those 'precomposed' codepoints exist because some legacy
character sets contained them, and having a one-to-one mapping
encouraged Unicode's adoption.



More information about the Python-list mailing list