[Python-Dev] PEP 385: the eol-type issue

Thu Aug 6 12:46:44 CEST 2009

M.-A. Lemburg wrote:
> Nick Coghlan wrote:
>> Antoine Pitrou wrote:
>>> M.-A. Lemburg <mal <at> egenix.com> writes:
>>>> Please file a bug report for this. f.readlines() (or rather
>>>> the io layer) should be using Py_UNICODE_ISLINEBREAK(ch)
>>>> for detecting line break characters.
>>>
>>> Actually, no. It has been designed from the start to only recognize the
>>> "standard" line break representations found in common formats/protocols (CR, LF
>>> and CR+LF).
>>> People wanting to split on arbitrary unicode line breaks should use
>>> str.splitlines().
>>
>> The fairly long-standing RFE relating to an arbitrarily selectable
>> newline separator seems relevant here:
>> http://bugs.python.org/issue1152248
>>
>> As with the discussion there, the problem with using str.splitlines is
>> that it prevents pipelining approaches that avoid reading a whole file
>> into memory.
>>
>> While removing the validity check from readlines() completely is
>> questionable (the readrecords() approach mentioned in the tracker issue
>> would still be better there), loosening the validity check to be based
>> on Py_UNICODE_IS_LINEBREAK seems a bit more feasible. (I'd still call it
>> a feature requests rather than a bug though).
> 
> I've had a look at the io implementation: this appears to be
> based on the universal newline support idea which addresses
> only a fixed set of "new line" character combinations and is
> not as straight forward to extend to support all Unicode
> line break characters as I thought.
> 
> What I don't understand is why the io layer tries to reinvent
> the wheel here instead of just using the codec's .readline()
> method - which *does* use .splitlines() and has full support
> for all Unicode line break characters (including the CRLF
> combination).

... and because of this, the feature is already available if
you use codecs.open() instead of the built-in open():

import codecs

with codecs.open("x.txt", "w", encoding='utf-8') as f:
  f.write("a\nb\u2029c\n")

with codecs.open("x.txt", "r", encoding='utf-8') as f:
  n = 1
  for l in f.readlines():
     print(n, repr(l))
     n += 1

This prints:

1 'a\n'
2 'b\u2029'
3 'c\n'

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 06 2009)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/