[Tutor] name shortening in a csv module output

Dave Angel d at davea.name
Thu Apr 23 16:29:42 CEST 2015


On 04/23/2015 06:37 AM, Jim Mooney wrote:
> ..
>
>> Ï»¿
>>
>> is the UTF-8 BOM (byte order mark) interpreted as Latin 1.
>>
>> If the input is UTF-8 you can get rid of the BOM with
>>
>> with open("data.txt", encoding="utf-8-sig") as csvfile:
>>
>
> Peter Otten
>
> I caught the bad arithmetic on name length, but where is the byte order
> mark coming from? My first line is plain English so far as I can see - no
> umlauts or foreign characters.
> first_name|last_name|email|city|state or region|address|zip
>
> Is this an artifact of csv module output, or is it the data from
> generatedata.com, which looks global? More likely it means I have to figure
> out unicode ;'(

A file is always stored as bytes, so if it's a text file, it is always 
an encoded file (although if it's ASCII, you tend not to think of that 
much).

So whatever program writes that file has picked an encoding, and when 
you read it you have to use the same encoding to safely read it into text.

By relying on the default when you read it, you're making an unspoken 
assumption about the encoding of the file.

There are dozens of common encodings out there, and anytime you get the 
wrong one, you're likely to mess up somewhere, unless it happens to be 
pure ASCII.

The BOM is not supposed to be used in a byte encoded file, but Notepad, 
among other programs does.  So it happens to be a good clue that the 
rest of the file is encoded in utf-8.  If that's the case, and if you 
want to strip the BOM, use utf-8-sig.

Note:  the BOM may be legal in utf-8 now, but it was originally intended 
to distinguish the UTF-32-BE from UTF-32-LE, as well as UTF-16-BE from 
UTF-16-LE.

https://docs.python.org/2/library/codecs.html#encodings-and-unicode

-- 
DaveA


More information about the Tutor mailing list