detecting newline character

Chris Rebert clp2 at rebertia.com
Sat Apr 23 15:12:20 EDT 2011


On Sat, Apr 23, 2011 at 11:09 AM, Daniel Geržo <danger at rulez.sk> wrote:
> Hello guys,
>
> I need to detect the newline characters used in the file I am reading. For
> this purpose I am using the following code:
>
> def _read_lines(self):
>    with contextlib.closing(codecs.open(self.path, "rU")) as fobj:
>        fobj.readlines()
>        if isinstance(fobj.newlines, tuple):
>            self.newline = fobj.newlines[0]
>        else:
>            self.newline = fobj.newlines
>
> This works fine, if I call codecs.open() without encoding argument; I am
> testing with an ASCII enghlish text file, and in such case the fobj.newlines
> is correctly detected being as '\r\n'. However, when I call codecs.open()
> with encoding='ascii' argument, the fobj.newlines is None and I can't figure
> out why that is the case. Reading the PEP at
> http://www.python.org/dev/peps/pep-0278/ I don't see any reason why would I
> end up with newlines being None after I call readlines().
>
> Anyone has an idea?

I would hypothesize that it's an interaction bug between universal
newlines and codecs.open().

http://docs.python.org/library/codecs.html#codecs.open :
"Note: Files are always opened in binary mode, even if no binary mode
was specified. This is done to avoid data loss due to encodings using
8-bit values. This means that no automatic conversion of '\n' is done
on reading and writing."

Meanwhile, the vanilla built-in open() docs, at least the way I
interpret them, say that "U" and "rU" (both with the same meaning) are
the only sensical `mode` values involving universal newlines.

I would speculate that the upshot of this is that codecs.open() ends
up calling built-in open() with a nonsense `mode` of "rUb" or similar,
resulting in strange behavior.

If this explanation is correct, then there are 2 bugs:
1. Built-in open() should treat "b" and "U" as mutually exclusive and
reject mode strings which involve both.
2. codecs.open() should either reject modes involving "U", or be fixed
so that they work as expected.

Cheers,
Chris
--
http://blog.rebertia.com



More information about the Python-list mailing list