Unicode error handler

Wed Jan 31 04:15:25 EST 2007

Martin v. Löwis wrote:

> Walter Dörwald schrieb:
>> You might try the following:
>>
>> # -*- coding: iso-8859-1 -*-
>>
>> import unicodedata, codecs
>>
>> def transliterate(exc):
>> 	if not isinstance(exc, UnicodeEncodeError):
>> 		raise TypeError("don'ty know how to handle %r" % r)
>> 	return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
>> exc.start+1)
> 
> I think a number of special cases need to be studied here.
> I would expect that this is "semantically correct" if the characters
> being dropped are combining characters (at least in the languages I'm
> familiar with, it is common to drop them for transliteration).

True, it might make sense to limit the error handler to handling latin 
characters.

> However, if you do
> 
> py> for i in range(65536):
> ...   c = unicodedata.normalize("NFD", unichr(i))
> ...   for c2 in c[1:]:
> ...     if not unicodedata.combining(c2): print hex(i),;break
> 
> you'll see that there are many characters which don't decompose
> into a base character + sequence of combining characters. In
> particular, this involves all hangul syllables (U+AC00..U+D7A3),
> for which it is just incorrect to drop the "jungseongs"
> (is that proper wording?).

Of course the above error handler only makes sense, when the decomposed 
codepoints are encodable in the target encoding. For your hangul example 
neither u"\ac00" nor the decomposed version u"\u1100\u1161" er encodable.

> There are also some cases which I'm completely uncertain about,
> e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E +
> ORIYA AI LENGTH MARK. Is it correct to drop the length mark?
> It's not listed as a combining character. Likewise,
> MYANMAR LETTER UU decomposes to MYANMAR LETTER U +
> MYANMAR VOWEL SIGN II; same question here.

Servus,
    Walter