Unicode string handling problem

John Roth JohnRoth1 at jhrothjr.com
Tue Sep 5 22:50:27 EDT 2006


Richard Schulman wrote:
> The following program fragment works correctly with an ascii input
> file.
>
> But the file I actually want to process is Unicode (utf-16 encoding).
> The file must be Unicode rather than ASCII or Latin-1 because it
> contains mixed Chinese and English characters.
>
> When I run the program below I get an attribute_count of zero, which
> is incorrect for the input file, which should give a value of fifteen
> or sixteen. In other words, the count function isn't recognizing the
> ", characters in the line being read. Here's the program:
>
> in_file = open("c:\\pythonapps\\in-graf1.my","rU")
> try:
>     # Skip the first line; make the second available for processing
>     in_file.readline()
>     in_line = readline()
>     attribute_count = in_line.count('",')
>     print attribute_count
> finally:
>     in_file.close()
>
> Any suggestions?
>
> Richard Schulman
> (For email reply, delete the 'xx' characters)

You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.

A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order.  The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.

John Roth




More information about the Python-list mailing list