Unicode string handling problem

John Machin sjmachin at lexicon.net
Thu Sep 7 18:37:02 EDT 2006


Richard Schulman wrote:
> It turns out that the Unicode input files I was working with (from MS
> Word and MS Notepad) were indeed creating eol sequences of \r\n, not
> \n\n as I had originally thought. The file reading statement that I
> was using, with unpredictable results, was
>
> #in_file =
> codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")
>
> This was reading to the \n on first read (outputting the whole line,
> including the \n but, weirdly, not the preceding \r). Then, also
> weirdly, the next readline would read the same \n again, interpreting
> that as the entirety of a phantom second line. So each input file line
> ended up producing two output lines.
>
> Once the mode string "rU" was dropped, as in
>
> in_file =
> codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")
>
> all suddenly became well: no more doubled readlines, and one could see
> the \r\n termination of each line.

You are on Windows. I would *not* describe as "well" lines read in (the
default) text mode ending in u"\r\n". It would expect it to convert the
line endings to u"\n". At best, this should be documented. Perhaps
someone with some knowledge of the intended treatment of line endings
by codecs.open() in text mode could comment? The two problems are
succintly described below:

File created in Windows Notepad and saved with "Unicode" encoding.
Results in UTF-16LE encoding, line terminator is CR LF, has BOM (LE) at
front -- as show below.

| Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> open('notepad_uc.txt', 'rb').read()
|
'\xff\xfea\x00b\x00c\x00\r\x00\n\x00d\x00e\x00f\x00\r\x00\n\x00g\x00h\x00i\x00\r
| \x00\n\x00'
| >>> import codecs
| >>> codecs.open('notepad_uc.txt', 'r',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\r\n', u'def\r\n', u'ghi\r\n']
| >>> codecs.open('notepad_uc.txt', 'r', encoding='utf_16').readlines()
| [u'abc\r\n', u'def\r\n', u'ghi\r\n']
### presence ot u'\r' was *not* expected
| >>> codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
| >>> codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16').readlines()
| [u'abc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
### 'U' flag does change the behaviour, but *not* as expected.

Cheers,
John




More information about the Python-list mailing list