writing \feff at the begining of a file

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Aug 13 21:54:27 EDT 2010


On Fri, 13 Aug 2010 18:25:46 -0400, Terry Reedy wrote:

> A short background to MRAB's answer which I will try to get right.
> 
> The byte-order-mark was invented for UTF-16 encodings so the reader
> could determine whether the pairs of bytes are in little or big endiean
> order, depending on whether the first two bute are fe and ff or ff and
> fe (or maybe vice versa, does not matter here). The concept is
> meaningless for utf-8 which consists only of bytes in a defined order.
> This is part of the Unicode standard.
> 
> However, Microsoft (or whoever) re-purposed (hijacked) that pair of
> bytes to serve as a non-standard indicator of utf-8 versus any
> non-unicode encoding. The result is a corrupted utf-8 stream that python
> accommodates with the utf-8-sig(nature) codec (versus the standard utf-8
> codec).


Is there a standard way to autodetect the encoding of a text file? I do 
this:

Open the file in binary mode; if the first three bytes are 
codecs.BOM_UTF8, then it's a Microsoft UTF-8 text file; otherwise if the 
first two byes are codecs.BOM_BE or codecs.BOM_LE, the encoding is utf-16-
be or utf-16-le respectively. 

(I don't bother to check for other BOMs, such as for utf-32. There are 
*lots* of them, but in my experience the encodings are rarely used, and 
the BOMs aren't defined in the codecs module, so I don't bother to 
support them.)

If there's no BOM, then re-open the file and read the first two lines. If 
either of them match this regex 'coding[=:]\s*([-\w.]+)' then I take the 
encoding name from that. This matches Python's behaviour, and supports 
EMACS and vi encoding declarations.

Otherwise, there is no declared encoding, and I use whatever encoding I 
like (whatever was specified by the user or the application default).


-- 
Steven



More information about the Python-list mailing list