Guessing the encoding from a BOM

Rustom Mody rustompmody at gmail.com
Fri Jan 17 00:08:23 EST 2014


On Friday, January 17, 2014 7:10:05 AM UTC+5:30, Tim Chase wrote:
> On 2014-01-17 11:14, Chris Angelico wrote:
> > UTF-8 specifies the byte order
> > as part of the protocol, so you don't need to mark it.

> You don't need to mark it when writing, but some idiots use it
> anyway.  If you're sniffing a file for purposes of reading, you need
> to look for it and remove it from the actual data that gets returned
> from the file--otherwise, your data can see it as corruption.  I end
> up with lots of CSV files from customers who have polluted it with
> Notepad or had Excel insert some UTF-8 BOM when exporting.  This
> means my first column-name gets the BOM prefixed onto it when the
> file is passed to csv.DictReader, grr.

And its part of the standard:
Table 2.4 here
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf



More information about the Python-list mailing list