Swedish characters in Python strings

Martin v. Löwis loewis at informatik.hu-berlin.de
Tue Oct 15 11:20:56 EDT 2002


Magnus Heino <magnus.heino at pleon.sigma.se> writes:

> > check the locale settings; to minimize the pain, make sure you use
> > an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> > use-only variable-width encoding like UTF-8.
> 
> Still, all new RH8 installs do use utf-8, and there must be a good reason 
> for that, and I guess its something they will do for a while now...

I disagree with Fredrik that UTF-8 is for internal use only. Using
UTF-8 locales is the only way to solve several aspects of Unix
internationalization, in particular supporting non-ASCII file names,
and supporting non-ASCII configuration files (specifically /etc/passwd).

> >>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
> >>> title
> 'K\xf6ttbullar i n\xe4san'
> >>> print title
> K?ttbullar i n?san

In this case, it appears that the title in the MP3 file is encoded in
Latin-1, not in UTF-8. Your terminal expects UTF-8. The data you print
are invalid UTF-8, so the terminal refuses to display them. To print
the data properly in your terminal, do

print unicode(title, "iso-8859-1").encode("utf-8")

Again, there is nothing that Python can do about that: It is not
possible to know what encoding title has - it could just as well be,
say, KOI8-R (in which case \xf6 would be CYRILLIC CAPITAL LETTER ZHE,
not LATIN SMALL LETTER O WITH DIAERESIS).

> Besides this stuff, I think it's really nice..

I find it a reasonable decision to suggest users to use UTF-8 as their
default encoding, irrespective of the language they
speak. Unfortunately, many applications are not really prepared for
multi-byte encodings, but those applications must be corrected.

Regards,
Martin



More information about the Python-list mailing list