[Python-3000] locale-aware strings ?

Wed Sep 6 15:18:19 CEST 2006

On 9/6/06, Oleg Broytmann <phd at oper.phd.pp.ru> wrote:
>    These situations are caused because of the lack of metadata or clear
> encoding-friendly standards. Ogg, for example, is encoding friendly - it
> clearly states that tags (comments) must be in UTF-8, and all Ogg Vorbis
> files I have saw were really in UTF-8, and all tag editors and players
> write/use UTF-8.

And yet I've run across vorbiscomments encoded in latin-1. It screws
everyone else up, but there are always going to be applications that
do not play along.

> XML is encoding-friendly - every file specifies its encoding.

And plenty of people use methods to read and write it which cannot
cope with non ascii files.

> HTTP protocol is mostly encoding friendly with its Content-Type
> header. HTML is partially encoding friendly, but only partially - if one
> saves an HTML page to a file it may lack an encoding information.

Right; HTTP has the means to indicate the encoding, but rarely does it
have the means to acquire it.

>    But text files and FTP protocol don't have any metadata, and ID3v2 don't
> specify an universal encoding or encoding metadata. In these cases programs
> can either guess encoding based on the file content or use system global
> encoding.

Actually, ID3v2 offers exactly four encodings: latin1, UTF16,
UTF16-BE, and UTF8. However UTF16 isn't endian-determined, and latin1
has been abused and holds the Windows ACP encoded text more often than
not, so it's a poor indicator. Another case of applications ignoring
the spec and doing what's easy. (I don't recall exactly when the
unicode encoding options were added, so they may have had little
choice; more likely they were too lazy to use UTF16 or it wouldn't
work on their portable device.)

>    I fail to see how Python can help here.

Absolutely agreed. I suspect the best option is some sort of TextFile
constructor that defaults to ASCII (or has no default) but accepts an
easy way to use the "recommended" or system encoding, or any explicit
one. And for more complicated formats, the code will just have to use
a bytestream layer, and decode as necessary. This may be a pain for
mbox files, but unless there's a way to switch encodings on the fly, a
seemingly text file will have to be treated as binary (newlines
excepted, I hope).

I also hope that, if the "recommended" encoding uses a heuristic on
the file's contents, the file has enough data in the encoding to make
a good guess. Music metadata rarely is that. :)

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog