[Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky
alexander.belopolsky at gmail.com
Thu Dec 2 22:34:34 CET 2010
On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
..
> I don't think so. str.split() and str.splitlines() are also defined in
> conformance to the SPEC, AFAIK. They certainly try to.
You are joking, right? Where exactly does Unicode specify something like this:
>>> ''.join('𐌀𐌁𐌂'.split('\udf00\ud800'))
'𐌁𐌂'
?
OK, splitting on a given separator has very little to do with Unicode
or UCD, but str.splitlines() makes absolutely no attempt to conform
to Unicode Standard Annex #14 ("Unicode line breaking algorithm").
Wait, UAX #14 is actually relevant to textwrap module which saw very
little change since 2.x days. So, what exactly does str.splitlines()
do? And which part of the Unicode standard defines how it is
different from str.split(.., '\n')? Reference manual does not help me
here either:
"""
str.splitlines([keepends])
Return a list of the lines in the string, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends is
given and true.
""" http://docs.python.org/dev/library/stdtypes.html#str.splitlines
More information about the Python-Dev
mailing list