algorithm to autodetect (japanese) encodings..

gabor gabor at z10n.net
Wed Mar 12 17:28:00 EST 2003


hi,

i' playing with mp3 tags,
and it's a hell to display them correctly, because most of them has
id3v1 tags, where isn't any encoding info.

so you have to guess...

most of files have standard english names, so nothing is above 127.
some of them have latin1 encoding, some utf-8, and some have some jap.
encodings ( anime soundtracks :-)

i'm trying to write a toUnicode function:
it should do the following:

if all the characters are below 127, simply convert to unicode as latin1
or utf-8 encoding ( should be the same)

if some chars are above 127, apply some heuristics to separate utf-8,
and those 3 jap. encodings ( shift-jis, iso-2022-jp, euc-jp).

my question:
does anyone have a working algo to find the correct encoding between the
3 jap. encodings?
i know java does it but does anyone have a python sourecode?
or something simply-translatable-to-python?

thanks,
gabor


-- 






More information about the Python-list mailing list