tail

Chris Angelico rosuav at gmail.com
Sun May 1 22:44:19 EDT 2022


On Mon, 2 May 2022 at 11:54, Cameron Simpson <cs at cskk.id.au> wrote:
>
> On 01May2022 23:30, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> >Dan Stromberg <drsalists at gmail.com> writes:
> >>But what about Unicode?  Are all 10 bytes newlines in Unicode encodings?
> >  It seems in UTF-8, when a value is above U+007F, it will be
> >  encoded with bytes that always have their high bit set.
>
> Aye. Design festure enabling easy resync-to-char-boundary at an
> arbitrary point in the file.

Yep - and there's also a distinction between "first byte of multi-byte
character" and "continuation byte, keep scanning backwards". So you're
guaranteed to be able to resynchronize.

(If you know whether it's little-endian or big-endian, UTF-16 can also
resync like that, since "high surrogate" and "low surrogate" look
different.)

> >  But Unicode has NEL "Next Line" U+0085 and other values that
> >  conforming applications should recognize as line terminators.
>
> I disagree. Maybe for printing things. But textual data records? I would
> hope to end them with NL, and only NL (code 10).
>

I'm with you on that - textual data records should end with 0x0A only.
But if there are text entities in there, they should be allowed to
include any Unicode characters, potentially including other types of
whitespace.

ChrisA


More information about the Python-list mailing list