Unicode string handling problem

Richard Schulman raschulmanxx at verizon.net
Tue Sep 5 23:55:18 EDT 2006


On 5 Sep 2006 19:50:27 -0700, "John Roth" <JohnRoth1 at jhrothjr.com>
wrote:

>> [T]he file I actually want to process is Unicode (utf-16 encoding).
>>...
>> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
>>...

John Roth:
>You're not detecting the file encoding and then
>using it in the open statement. If you know this is
>utf-16le or utf-16be, you need to say so in the
>open. If you don't, then you should read it into
>a string, go through some autodetect logic, and
>then decode it with the <string>.decode(encoding)
>method.
>
>A clue: a properly formatted utf-16 or utf-32
>file MUST have a BOM as the first character.
>That's mandated in the unicode standard. If
>it doesn't have a BOM, then try ascii and
>utf-8 in that order.  The first
>one that succeeds is correct. If neither succeeds,
>you're on your own in guessing the file encoding.

Thanks for this further information. I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.



More information about the Python-list mailing list