[Tutor] name shortening in a csv module output
Dave Angel
davea at davea.name
Thu Apr 23 23:30:43 CEST 2015
On 04/23/2015 02:14 PM, Jim Mooney wrote:
>>
>> By relying on the default when you read it, you're making an unspoken
>> assumption about the encoding of the file.
>>
>> --
>> DaveA
>
>
> So is there any way to sniff the encoding, including the BOM (which appears
> to be used or not used randomly for utf-8), so you can then use the proper
> encoding, or do you wander in the wilderness? I was going to use encoding =
> utf-8 as a suggested default. I noticed it got rid of the bom symbols but
> left an extra blank space at the beginning of the stream. Most books leave
> unicode to the very end, if they mention the BOM at all (mine is at page
> 977, which is still a bit off ;')
That's not a regular blank, See the link I mentioned before, and the
following sentence:
""" Unfortunately the character U+FEFF had a second purpose as a ZERO
WIDTH NO-BREAK SPACE: a character that has no width and doesn’t allow a
word to be split. It can e.g. be used to give hints to a ligature
algorithm. """
To automatically get rid of that BOM character when reading a file, you
use utf-8-sig, rather than utf-8. And on writing, since you probably
don't want it, use utf-8.
As for guessing what encoding was used, the best approach is to ask the
person/program that wrote the file. Or read the specs. And once you
figure it out, fix the specs.
With a short sample, you're unlikely to guess right. That's because
ASCII looks the same in all the byte-encoded formats. (Not in the
various *16* and *32* formats, as they use 2 bytes or 4 bytes each) If
you encounter one of those, you'll probably see lots of null bytes mixed
in a consistent pattern.
Consider the 'file' command in Linux. I don't know of any Windows
equivalent.
If you want to write your own utility, perhaps to scan hundreds of
files, consider:
http://pypi.python.org/pypi/chardet
http://linux.die.net/man/3/libmagic
https://github.com/ahupp/python-magic
--
--
DaveA
More information about the Tutor
mailing list