Unicode string handling problem

Thu Sep 7 12:56:57 EDT 2006

Many thanks for your help, John, in giving me the tools to work
successfully in Python with Unicode from here on out.

It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.

This behavior of "rU" was not at all what I had expected from the
brief discussion of it in _Python Cookbook_. Which all goes to point
out how difficult it is to cook challenging dishes with sketchy
recipes alone. There is no substitute for the helpful advice of an
experienced chef.

-Richard Schulman
 (remove "xx" for email reply)

On 5 Sep 2006 22:29:59 -0700, "John Machin" <sjmachin at lexicon.net>
wrote:

>Richard Schulman wrote:
>[big snip]
>>
>> The BOM is little-endian, I believe.
>Correct.
>
>> >in_file = codecs.open(filepath, mode, encoding="utf16???????")
>>
>> Right you are. Here is the output produced by so doing:
>
>You don't say which encoding you used, but I guess that you used
>utf_16_le.
>
>>
>> <type 'unicode'>
>> u'\ufeffINSERT INTO [...] VALUES\N'
>
>Use utf_16 -- it will strip off the BOM for you.
>
>> <type 'unicode'>
>> u'\n'
>> 0   [The counter value]
>>
>[snip]
>> Yes, it did. Many thanks! Now I've got to figure out the best way to
>> handle that \n\n at the end of each row, which the program is
>> interpreting as two rows.
>
>Well we don't know yet exactly what you have there. We need a byte dump
>of the first few bytes of your file. Get into the interactive
>interpreter and do this:
>
>open('yourfile', 'rb').read(200)
>(the 'b' is for binary, in case you are on Windows)
>That will show us exactly what's there, without *any* EOL
>interpretation at all.
>
>
>> That represents two surprises: first, I
>> thought that Microsoft files ended as \n\r ;
>
>Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
>(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
>from CP/M.
>
>Ummmm ... are you saying the file has \n\r at the end of each row?? How
>did you know that if you didn't know what if any BOM it had??? Who
>created the file????
>
>> second, I thought that
>> Python mode "rU" was supposed to be the universal eol handler and
>> would handle the \n\r as one mark.
>
>Nah again. It contemplates only \n, \r, and \r\n as end of line. See
>the docs. Thus \n\r becomes *two* newlines when read with "rU".
>
>Having "\n\r" at the end of each row does fit with your symptoms:
>
>| >>> bom = u"\ufeff"
>| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
>| >>> guffu = unicode(guff)
>| >>> import codecs
>| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
>| >>> f.write(bom+guffu)
>| >>> f.close()
>
>| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got
>
>|
>'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'
>
>| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
>| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!
>
>| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
>| u'abc\n\ndef\n\nghi' #### U means \r -> \n
>
>| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
>| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
>experience
>
>| >>> open('guff.utf16le', 'rU').readlines()
>| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
>'\x00\n', '\x00
>| g\x00h\x00i\x00']
>| >>> f = open('guff.utf16le', 'rU')
>| >>> f.readline()
>| '\xff\xfea\x00b\x00c\x00\n'
>| >>> f.readline()
>| '\x00\n' ######### reproduces your first experience
>| >>> f.readline()
>| '\x00d\x00e\x00f\x00\n'
>| >>>
>
>If that file is a one-off, you can obviously fix it by
>throwing away every second line. Otherwise, if it's an ongoing
>exercise, you need to talk sternly to the file's creator :-)
>
>HTH,
>John