[Python-3000] Pre-PEP: Easy Text File Decoding

Thu Sep 14 15:01:23 CEST 2006

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> writes:

> You're correct about the use of a BOM as a signature. All
> Unicode-conformant applications should accept this use of a BOM in
> UTF-8 (although they need not generate it); the standard is quite
> clear on that.

When a program generates a list of filenames in a file, and I do
   xargs -i cp {} some-dir/ <filenames-file
and one file is not found because a UTF-8 BOM has been inserted before
its name, I won't blame xargs. I will blame the program which geneated
the filenames. Or the language it is written in, if it didn't create
the BOM explicitly.

                          *       *       *

A tricky issue is handling filenames which can't be decoded.

I'm willing to blame myself when the list of filenames contains names
which can't be decoded using the locale encoding, because I know no
good solution to the problem of representing arbitrary Linux filenames
as Unicode strings.

Some people would blame the program or the language.

OTOH there exist libraries which believe that all filenames should be
UTF-8, irrespective of the locale. In particular Gnome used to require
setting the environment variable G_BROKEN_FILENAMES when filenames are
not UTF-8 (now G_FILENAME_ENCODING can be set). I disagree with them.

This applies to Linux. I think MacOS uses UTF-8 filenames, so the
story is different there.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/