[issue7643] What is an ASCII linebreak?

Fri Jan 8 21:18:23 CET 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf at yahoo.fr> added the comment:
> 
> Some technical background.
> 
> == Unicode ==
> 
> According to the Unicode Standard Annex #9, a character with
> bidirectional class B is a "Paragraph Separator". And “Because a
> Paragraph Separator breaks lines, there will be at most one per line,
> at the end of that line.”
> 
> As a consequence, there's 3 reasons to identify a character as a
> linebreak:
>  - General Category Zl "Line Separator"
>  - General Category Zp "Paragraph Separator"
>  - Bidirectional Class B "Paragraph Separator"

This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).

> There's 8 linebreaks in the current Unicode Database (5.2):
> ------------------------------------------------------------------------
> 000A    LF  LINE FEED                   Cc  B
> 000D    CR  CARRIAGE RETURN             Cc  B
> 001C    FS  INFORMATION SEPARATOR FOUR  Cc  B (UCD 3.1 FILE SEPARATOR)
> 001D    GS  INFORMATION SEPARATOR THREE Cc  B (UCD 3.1 GROUP SEPARATOR)
> 001E    RS  INFORMATION SEPARATOR TWO   Cc  B (UCD 3.1 RECORD SEPARATOR)
> 0085    NEL NEXT LINE                   Cc  B (C1 Control Code)
> 2028    LS  LINE SEPARATOR              Zl  WS  (Unicode)
> 2029    PS  PARAGRAPH SEPARATOR         Zp  B   (Unicode)
> ------------------------------------------------------------------------

And that's the list we're currently using.

> == ASCII ==
> 
> The Standard ASCII control codes (C0) are in the range 00-1F.
> It limits the list to LF, CR, FS, GS, RS.
> Regarding the last three, they are not considered as linebreaks:
> “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
> structure data, usually on a tape, in order to simulate punched cards. End of
> medium (EM) warns that the tape (or whatever) is ending. While many systems use
> CR/LF and TAB for structuring data, it is possible to encounter the separator
> control characters in data that needs to be structured. The separator control
> characters are not overloaded; there is no general use of them except to
> separate data into structured groupings. Their numeric values are contiguous
> with the space character, which can be considered a member of the group, as a
> word separator.”
> (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
> 
> In conclusion, it may be better to keep things unchanged.

Agreed.

> We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

For ASCII we should make the list of characters explicit.
For Unicode, we should mention the above definition and give
the table as example list (the Unicode database may add more
such characters in the future).

> References:
>  - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
>  - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
>  - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
>  - C0 and C1 Control Codes:
>      http://en.wikipedia.org/wiki/C0_and_C1_control_codes

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7643>
_______________________________________