[Python-3000] BOM handling

Thu Sep 14 20:58:47 CEST 2006

Blake Winton <bwinton at latte.ca> wrote:
> Josiah Carlson wrote:
> > Blake Winton <bwinton at latte.ca> wrote:
> >> I'm not going to 
> >> suggest an API, other than it would be nice if I didn't have to manually 
> >> figure out/hard code all the encodings.  (It's my belief that I will 
> >> currently have to do that, or at least special-case XML, to read the 
> >> encoding attribute.)
> > Use the XML tag/attribute "<?xml ... encoding="..." ?> to discover the
> > encoding and assume utf-8 otherwise as per spec:
> > http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl
> 
> Yeah, but now you're requiring me to read and understand the file's 
> contents, which is something I (as someone who doesn't particularly care 
> about all this "encoding" stuff) am trying very hard not to do.  Does 
> no-one write generic text processing programs anymore?

Not too long ago, "generic text processing programs" only had to deal
with one of ascii, ebdic, etc., or were written specifically for text
encoded for a particular locale.  Times have changed, but the tools
really haven't.  If you want to easily deal with such things, write the
module.

> If I were to write a program which rotated an image using PIL, I 
> wouldn't have to care whether it was a png or a jpeg.  (At least, I'm 
> pretty sure I wouldn't.  I haven't tried recently.)

Right, but gif, png, jpeg, bmp, and scores of other multimedia formats
contain the equivalent to a Python coding: directive. Examine the first
dozen or so bytes bytes of basically any kind of image, sound (not mp3s
though), or movie, and you will notice an ascii specifier for the type
of file.

By writing the registry module I described, one would be, in essence,
writing a library that understands what kind of media it has been handed,
at least as much as the equivalent of "this is a bmp" or "this is a gif".

>  > Is there a bash equivalent to Python coding: directives?  You may be
>  > attempting to fix a problem that doesn't exist.
> 
> I don't know if the magic number stuff to determine whether a file is 
> executable or not is bash-specific.  Either way, when I save the file in 
> UTF-8, it's fine, but when I save it in UTF-8 with a BOM, it fails.

So don't save it with a BOM and add a Python coding: directive to the
second line.  Python and bash comments just happen to have the same #
delimiter, and if your editor doesn't suck, then it should understand
such a directive.  With luck, your editor should also allow for the
non-writing of the BOM on utf-8 save (given certain conditions).  If not,
contact the author(s) and request that feature.

> > So you, or anyone else, can write a module for discovering the encoding
> > used for a particular file based on XML tags, Python coding: directives,
> > etc. It could include an extensible registry, and if it is used enough,
> > could be included in the Python standard library.
> 
> Okay, so what will happen for file types which aren't in the registry, 
> like that Windows .rc files?

I'm not writing the encoding registry, but if I was, and if no known
encoding was found, I'd claim latin-1, if only because it 'succeeds'
when decoding character values 128-255.

> I was lying up above when I said that I don't care about this sort of 
> thing.  I do care, but I also believe that I am, and should be, in the 
> minority, and that if we can't ship something that will work for people 
> who don't care about this stuff, then we've failed both them and Python.

Indeed, which is why people who do care should write a registry so that
their users don't need to care.

 - Josiah