Re: encoding problems (é and è)

Jean-Paul Calderone exarkun at divmod.com
Thu Mar 23 22:19:15 EST 2006


On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjmachin at lexicon.net> wrote:
>On 24/03/2006 8:36 AM, Peter Otten wrote:
>> John Machin wrote:
>>
>>>You can replace ALL of this upshifting and accent removal in one blow by
>>>using the string translate() method with a suitable table.
>>
>> Only if you convert to unicode first or if your data maintains 1 byte == 1
>> character, in particular it is not UTF-8.
>>
>
>I'm sorry, I forgot that there were people who are unaware that
>variable-length gizmos like UTF-8 and various legacy CJK encodings are
>for storage & transmission, and are better changed to a
>one-character-per-storage-unit representation before *ANY* data
>processing is attempted.

Unfortunately, unicode only appears to solve this problem in a sane manner.  Most people conveniently forget (or never learn in the first place) about combining sequences and denormalized forms.  Consider u'e\u0301', u'U\u0301', or u'C\u0327'.  These difficulties can be mitigated to some degree via normalization (see unicodedata.normalize), but this step is often forgotten and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH YIWN), it does not even work.

>
>:-)
>Unicode? I'm just a benighted Anglo from the a**-end of the globe; who
>am I to be preaching Unicode to a European?
>(-:

Heh ;P  Same here.  And I don't really claim to understand all this stuff, I just know enough to know it's really hard to do anything correctly. ;)

Jean-Paul



More information about the Python-list mailing list