Unicode string handling problem

Wed Sep 6 01:29:59 EDT 2006

Richard Schulman wrote:
[big snip]
>
> The BOM is little-endian, I believe.
Correct.

> >in_file = codecs.open(filepath, mode, encoding="utf16???????")
>
> Right you are. Here is the output produced by so doing:

You don't say which encoding you used, but I guess that you used
utf_16_le.

>
> <type 'unicode'>
> u'\ufeffINSERT INTO [...] VALUES\N'

Use utf_16 -- it will strip off the BOM for you.

> <type 'unicode'>
> u'\n'
> 0   [The counter value]
>
[snip]
> Yes, it did. Many thanks! Now I've got to figure out the best way to
> handle that \n\n at the end of each row, which the program is
> interpreting as two rows.

Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:

open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.

> That represents two surprises: first, I
> thought that Microsoft files ended as \n\r ;

Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.

Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????

> second, I thought that
> Python mode "rU" was supposed to be the universal eol handler and
> would handle the \n\r as one mark.

Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".

Having "\n\r" at the end of each row does fit with your symptoms:

| >>> bom = u"\ufeff"
| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>> guffu = unicode(guff)
| >>> import codecs
| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>> f.write(bom+guffu)
| >>> f.close()

| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got

|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'

| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -> \n

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience

| >>> open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>> f = open('guff.utf16le', 'rU')
| >>> f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>> f.readline()
| '\x00\n' ######### reproduces your first experience
| >>> f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>

If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator :-)

HTH,
John