[Tutor] name shortening in a csv module output

Dave Angel davea at davea.name
Thu Apr 23 23:30:43 CEST 2015


On 04/23/2015 02:14 PM, Jim Mooney wrote:
>>
>> By relying on the default when you read it, you're making an unspoken
>> assumption about the encoding of the file.
>>
>> --
>> DaveA
>
>
> So is there any way to sniff the encoding, including the BOM (which appears
> to be used or not used randomly for utf-8), so you can then use the proper
> encoding, or do you wander in the wilderness? I was going to use encoding =
> utf-8 as a suggested default. I noticed it got rid of the bom symbols but
> left an extra blank space at the beginning of the stream. Most books leave
> unicode to the very end, if they mention the BOM at all (mine is at page
> 977, which is still a bit off ;')


That's not a regular blank,  See the link I mentioned before, and the 
following sentence:

""" Unfortunately the character U+FEFF had a second purpose as a ZERO 
WIDTH NO-BREAK SPACE: a character that has no width and doesn’t allow a 
word to be split. It can e.g. be used to give hints to a ligature 
algorithm. """

To automatically get rid of that BOM character when reading a file, you 
use utf-8-sig, rather than utf-8.  And on writing, since you probably 
don't want it, use utf-8.

As for guessing what encoding was used, the best approach is to ask the 
person/program that wrote the file.  Or read the specs.  And once you 
figure it out, fix the specs.

With a short sample, you're unlikely to guess right.  That's because 
ASCII looks the same in all the byte-encoded formats.  (Not in the 
various *16* and *32* formats, as they use 2 bytes or 4 bytes each)  If 
you encounter one of those, you'll probably see lots of null bytes mixed 
in a consistent pattern.


Consider the 'file' command in Linux.  I don't know of any Windows 
equivalent.

If you want to write your own utility, perhaps to scan hundreds of 
files, consider:

   http://pypi.python.org/pypi/chardet
   http://linux.die.net/man/3/libmagic
   https://github.com/ahupp/python-magic

-- 
-- 
DaveA


More information about the Tutor mailing list