[Python-Dev] Python and the Unicode Character Database

Thu Dec 2 22:34:34 CET 2010

On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
..
> I don't think so.  str.split() and str.splitlines() are also defined in
> conformance to the SPEC, AFAIK.  They certainly try to.

You are joking, right?  Where exactly does Unicode specify something like this:

>>> ''.join('𐌀𐌁𐌂'.split('\udf00\ud800'))
'𐌁𐌂'
?

OK, splitting on a given separator has very little to do with Unicode
or UCD, but str.splitlines()  makes absolutely no attempt to conform
to Unicode Standard Annex #14 ("Unicode line breaking algorithm").
Wait, UAX #14 is actually relevant to textwrap module which saw very
little change since 2.x days.  So, what exactly does str.splitlines()
do?   And which part of the Unicode standard defines how it is
different from str.split(.., '\n')?  Reference manual does not help me
here either:

"""
str.splitlines([keepends])

Return a list of the lines in the string, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends is
given and true.
""" http://docs.python.org/dev/library/stdtypes.html#str.splitlines