[issue22232] str.splitlines splitting on non-\r\n characters

Terry J. Reedy report at bugs.python.org
Sun Aug 24 00:01:19 CEST 2014


Terry J. Reedy added the comment:

I was not aware of the remainder of the undocumented behavior. Thanks for the code that makes it clear .

linebreak (or linebreaks)=True means that splitting occurs on some (approximation?*) of unicode mandatory linebreaks, as opposed to just the ascii 'universal newline' sequences, as defined in our glossary. Possible alternative: restrict=False (restrict to u. newlines?)

*I did not read the annex in enough detail to know either way.

The following pair of experiments, which I should have run before, show that there has been no real change of behavior from 2.x to 3.x.
# 2.7.8
>>> u'a\x0ab\x0bc\x0cd\x0dda\x0d\x0a1c\x1c1d\x1d1e\x1e85\x852028\u20282029\u2029end'.splitlines()
[u'a', u'b', u'c', u'd', u'da', u'1c', u'1d', u'1e', u'85', u'2028', u'2029', u'end']
# 3.4.1
b'a\x0ab\x0bc\x0cd\x0dda\x0d\x0a1c\x1c1d\x1d1e\x1e85\x85end'.splitlines()
[b'a', b'b\x0bc\x0cd', b'da', b'1c\x1c1d\x1d1e\x1e85\x85end']

Given this, I am a bit dubious about adding a new parameter in 3.5 to make the unicode method act like the bytes method. Part of my support for that was thinking that it would help porting code. But that is not true. In both 2 and 3, there is the possibility to latin-1 encode, split, and latin-1 decode the pieces.

The doc correction clearly needed is that the 3.4+ universal newlines glossary entry needs to be updated from 'str.splitlines' to 'bytes.newlines'. I will try to do this.

A second doc problem is that the docstrings (given by help(x.splitlines) are exactly the same for bytes.splitlines and unicode.splitlines, in both 2.x and 3.x, even though the behavior is different even for ascii. I think each should list what they split on.  Ditto for the doc is not already.  This should have a patch posted for review.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue22232>
_______________________________________


More information about the Python-bugs-list mailing list