[I18n-sig] codecs module, readlines and xreadlines

Martin v. Löwis martin@v.loewis.de
16 Jan 2003 17:08:28 +0100


Poor Yorick <gp@pooryorick.com> writes:

> The thing is that I AM processing text data.  It just happens to be
> unicode text data.  The example I used turns into perfectly legible
> chinese characters once it's decoded in Python.  I think that people
> using the codecs module on Windows to read Unicode text files would
> expect codecs.open.readlines to behave exactly like the builtin
> open.readlines.  

Would you like to work on a patch to fix this problem?

> open.readlines automatically removes the "\r" character on Windows
> systems when the file is opened and read in text mode, and inserts a
> \r character when a \n is written to a file, so to be consistent,
> codecs.open.readlines should do the same thing and remove \x00\r
> when the file is opened in text mode.

It is not Python code which does that, though: instead, the Microsoft
C library does the removal/insertion of \r. For Unicode, this is
useless, since we cannot open the file in text mode: The C library
would *still* remove \r (only), leaving us with an extra null byte.

Notice that a similar problem exists on the Mac, where \r should be
replaced by \n.

Regards,
Martin