[issue7643] What is an ASCII linebreak?

Fri Jan 8 22:08:22 CET 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf at yahoo.fr> added the comment:
> 
> It's confusing.
> 
> There's a specific annex UAX #14 which defines "Line Breaking Properties".
> Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
>   BK, CR, LF, NL

Note that a line breaking algorithm is something different than
a line split algorithm. The latter is used to separate lines at
pre-defined positions in the text, the former is used to format
a piece of text to fit e.g. into a certain width of available
character positions.

.splitlines() implements a line splitting algorithm, not a line
breaking one.

> And the resulting list is different:
>                                        CAT BIDI BRK
> ------------------------------------------------------------------------
> 000A    LF  LINE FEED                   Cc  B   LF
> 000B    VT  LINE TABULATION             Cc  S   BK (since Unicode 5.0) 
> 000C    FF  FORM FEED                   Cc  WS  BK
> 000D    CR  CARRIAGE RETURN             Cc  B   CR
> 0085    NEL NEXT LINE                   Cc  B   NL (C1 Control Code)
> 2028    LS  LINE SEPARATOR              Zl  WS  BK
> 2029    PS  PARAGRAPH SEPARATOR         Zp  B   BK
> ------------------------------------------------------------------------
>
> Differences:
>  - VT and FF are mandatory breaks (even if “implementations are not
>    required to support the VT character”)
>  - FS, GS, US are combined marks (CM): “Prohibit a line break between
>    the character and the preceding character”
> 
> According to this Annex, the current splitlines() implementation violates the Unicode standard.

It appears so and I guess that's an oversight on my part when
writing the code: in Unicode 2.1 (the version I started with),
FF was marked as "B", later on Unicode 3.0 was published and
the new LineBreak.txt file was added to the standard. FF was
changed to "WS" and instead marked as "BK" in that new LineBreak.txt
file.

Since we only used the main UnicodeData.txt file as basis for
the type database, the "FF" code point dropped out of the
line break code point set.

I guess we'll have to add FF and VT to the generator makeunicodedata.py
to remedy this.

> References:
>  - Unicode Standard Annex #14 - Line Breaking Algorithm
>    http://www.unicode.org/reports/tr14/
>  - UCD LineBreak.txt
>    http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7643>
_______________________________________