Unicode string handling problem

Tue Sep 5 23:47:09 EDT 2006

Thanks for your excellent debugging suggestions, John. See below for
my follow-up:

Richard Schulman:
>> The following program fragment works correctly with an ascii input
>> file.
>>
>> But the file I actually want to process is Unicode (utf-16 encoding).
>> The file must be Unicode rather than ASCII or Latin-1 because it
>> contains mixed Chinese and English characters.
>>
>> When I run the program below I get an attribute_count of zero, which
>> is incorrect for the input file, which should give a value of fifteen
>> or sixteen. In other words, the count function isn't recognizing the
>> ", characters in the line being read. Here's the program:
>>...

John Machin:
>Insert
>    print type(in_line)
>    print repr(in_line)
>here [also make the appropriate changes to get the same info from the
>first line], run it again, copy/paste what you get, show us what you
>see.

Here's the revised program, per your suggestion:

=====================================================

# This program processes a UTF-16 input file that is
# to be loaded later into a mySQL table. The input file
# is not yet ready for prime time. The purpose of this
#  program is to ready it.

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
    # The first line read is a SQL INSERT statement; no
    # processing will be required.
    in_line = in_file.readline()
    print type(in_line)  #For debugging
    print repr(in_line)  #For debugging

    # The second line read is the first data row.	
    in_line = in_file.readline()
    print type(in_line)  #For debugging
    print repr(in_line)  #For debugging

    # For this and subsequent rows, we must count all
    # the < ", > character-pairs in a given line/row.
    # This  will provide an n-1 measure of the attributes
    # for a SQL insert of this row. All rows must have 
    # sixteen attributes, but some don't yet.
    attribute_count = in_line.count('",')
    print attribute_count
finally:
    in_file.close()

=====================================================

The output of this program, which I ran at the command line,
must needs to be copied by hand and abridged, but I think I
have included the relevant information:

C:\pythonapps>python graf_correction.py
<type 'str'>
'\xff\xfeI\x00N\x00S...   [the beginning of a SQL INSERT statement]
...\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
                           followed by an end-of-line]
<type 'str'>
'\x00\n'                  [oh-oh! For the second row, all we're seeing
                           is an end-of-line character. Is that from
                           the first row? Wasn't the "rU" mode
                           supposed to handle that]
0                         [the counter value. It's hardly surprising
                           it's only zero, given that most of the row
                           never got loaded, just an eol mark]

J.M.:
>If you're coy about that, then you'll have to find out yourself if it
>has a BOM at the front, and if not whether it's little/big/endian.

The BOM is little-endian, I believe.

R.S.:
>> Any suggestions?

J.M.
>1. Read the Unicode HOWTO.
>2. Read the docs on the codecs module ...
>
>You'll need to use
>
>in_file = codecs.open(filepath, mode, encoding="utf16???????")

Right you are. Here is the output produced by so doing:

<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
<type 'unicode'>
u'\n' 
0   [The counter value]

>It would also be a good idea to get into the habit of using unicode
>constants like u'",'

Right.

>HTH,
>John

Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows. That represents two surprises: first, I
thought that Microsoft files ended as \n\r ; second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Richard Schulman