[issue7643] What is an ASCII linebreak?
Marc-Andre Lemburg
report at bugs.python.org
Fri Jan 8 22:08:22 CET 2010
Marc-Andre Lemburg <mal at egenix.com> added the comment:
Florent Xicluna wrote:
>
> Florent Xicluna <laxyf at yahoo.fr> added the comment:
>
> It's confusing.
>
> There's a specific annex UAX #14 which defines "Line Breaking Properties".
> Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
> BK, CR, LF, NL
Note that a line breaking algorithm is something different than
a line split algorithm. The latter is used to separate lines at
pre-defined positions in the text, the former is used to format
a piece of text to fit e.g. into a certain width of available
character positions.
.splitlines() implements a line splitting algorithm, not a line
breaking one.
> And the resulting list is different:
> CAT BIDI BRK
> ------------------------------------------------------------------------
> 000A LF LINE FEED Cc B LF
> 000B VT LINE TABULATION Cc S BK (since Unicode 5.0)
> 000C FF FORM FEED Cc WS BK
> 000D CR CARRIAGE RETURN Cc B CR
> 0085 NEL NEXT LINE Cc B NL (C1 Control Code)
> 2028 LS LINE SEPARATOR Zl WS BK
> 2029 PS PARAGRAPH SEPARATOR Zp B BK
> ------------------------------------------------------------------------
>
> Differences:
> - VT and FF are mandatory breaks (even if “implementations are not
> required to support the VT character”)
> - FS, GS, US are combined marks (CM): “Prohibit a line break between
> the character and the preceding character”
>
> According to this Annex, the current splitlines() implementation violates the Unicode standard.
It appears so and I guess that's an oversight on my part when
writing the code: in Unicode 2.1 (the version I started with),
FF was marked as "B", later on Unicode 3.0 was published and
the new LineBreak.txt file was added to the standard. FF was
changed to "WS" and instead marked as "BK" in that new LineBreak.txt
file.
Since we only used the main UnicodeData.txt file as basis for
the type database, the "FF" code point dropped out of the
line break code point set.
I guess we'll have to add FF and VT to the generator makeunicodedata.py
to remedy this.
> References:
> - Unicode Standard Annex #14 - Line Breaking Algorithm
> http://www.unicode.org/reports/tr14/
> - UCD LineBreak.txt
> http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt
Thanks,
--
Marc-Andre Lemburg
eGenix.com
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
----------
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7643>
_______________________________________
More information about the Python-bugs-list
mailing list