Unicode error handler

"Martin v. Löwis" martin at v.loewis.de
Wed Jan 31 02:48:58 EST 2007


Walter Dörwald schrieb:
> You might try the following:
> 
> # -*- coding: iso-8859-1 -*-
> 
> import unicodedata, codecs
> 
> def transliterate(exc):
> 	if not isinstance(exc, UnicodeEncodeError):
> 		raise TypeError("don'ty know how to handle %r" % r)
> 	return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
> exc.start+1)

I think a number of special cases need to be studied here.
I would expect that this is "semantically correct" if the characters
being dropped are combining characters (at least in the languages I'm
familiar with, it is common to drop them for transliteration).

However, if you do

py> for i in range(65536):
...   c = unicodedata.normalize("NFD", unichr(i))
...   for c2 in c[1:]:
...     if not unicodedata.combining(c2): print hex(i),;break

you'll see that there are many characters which don't decompose
into a base character + sequence of combining characters. In
particular, this involves all hangul syllables (U+AC00..U+D7A3),
for which it is just incorrect to drop the "jungseongs"
(is that proper wording?).

There are also some cases which I'm completely uncertain about,
e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E +
ORIYA AI LENGTH MARK. Is it correct to drop the length mark?
It's not listed as a combining character. Likewise,
MYANMAR LETTER UU decomposes to MYANMAR LETTER U +
MYANMAR VOWEL SIGN II; same question here.

Regards,
Martin



More information about the Python-list mailing list