[Python-3000] Pre-PEP: Easy Text File Decoding

Josiah Carlson jcarlson at uci.edu
Sun Sep 10 23:47:13 CEST 2006


David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Josiah Carlson wrote:
[snip]
> > Using the xml guessing mechanism is fine, as long as you get it right. 
> > A first pass with BOM detection and a second pass to "guess" based on
> > content in the case that a BOM isn't detected seems to make sense.
> 
> ... if you think that guessing based on content is a good idea -- I don't.
> In any case, such guessing necessarily depends on the expected file format,
> so it should be done by the application itself, or by a library that knows
> more about the format.

I'm keeping my hat out of the ring for whether guessing is a good idea. 
However, if one is going to have a guessing mechanic, starting with UTF
BOMS is a good start, which is what I was trying to say.


> If the encoding of a text stream were settable after it had been opened,
> then it would be easy for anyone to implement whatever guessing algorithm
> they needed, without having to write an encoding implementation or include
> any other support for guessing in the I/O library itself.

That is true.  But considering that you, presumably an experienced
programmer with regards to unicode, have provided an algorithm with an
obvious hole that I was able to discover in a few moments, suggests that
guessing algorithms are not easy to write.

> (This also requires the ability to seek back to the beginning of the stream
> after reading the data needed for the guess.)
> 
> > Note that the above algorithm returns UTF32BE for a files beginning with
> > 4 null bytes.
> 
> Yes. But such a thing probably isn't a text file at all -- in which case
> there will be subsequent decoding errors when most of the code units are
> not in the range 0 to 0x10FFFF.

A file starting with 4 nulls certainly will likely imply a non-text file
of some kind, but presuming that "most" code points would not be in the
0...0x10ffff range is a bit of assumption about the content of a file. I
thought you didn't want to guess.


 - Josiah



More information about the Python-3000 mailing list