tail

Chris Angelico rosuav at gmail.com
Sun May 1 19:41:43 EDT 2022


On Mon, 2 May 2022 at 09:19, Dan Stromberg <drsalists at gmail.com> wrote:
>
> On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <cs at cskk.id.au> wrote:
>
> > On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> > >Something like this is OK?
> >
>
> Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine.
>
> But what about Unicode?  Are all 10 bytes newlines in Unicode encodings?

Most absolutely not. "Unicode" isn't an encoding, but of the Unicode
Transformation Formats and Universal Character Set encodings, most
don't make that guarantee:

* UTF-8 does, as mentioned. It sacrifices some efficiency and
consistency for a guarantee that ASCII characters are represented by
ASCII bytes, and ASCII bytes only ever represent ASCII characters.
* UCS-2 and UTF-16 will both represent BMP characters with two bytes.
Any character U+xx0A or U+0Axx will include an 0x0A in its
representation.
* UTF-16 will also encode anything U+000xxx0A with an 0x0A. (And I
don't think any codepoints have been allocated that would trigger
this, but UTF-16 can also use 0x0A in the high surrogate.)
* UTF-32 and UCS-4 will use 0x0A for any character U+xx0A, U+0Axx, and
U+Axxxx (though that plane has no characters on it either)

So, of all the available Unicode standard encodings, only UTF-8 makes
this guarantee.

Of course, if you look at documents available on the internet, UTF-8
the encoding used by the vast majority of them (especially if you
include seven-bit files, which can equally be considered ASCII,
ISO-8859-x, and UTF-8), so while it might only be one encoding out of
many, it's probably the most important :)

In general, you can *only* make this parsing assumption IF you know
for sure that your file's encoding is UTF-8, ISO-8859-x, some OEM
eight-bit encoding (eg Windows-125x), or one of a handful of other
compatible encodings. But it probably will be.

> If not, and you have a huge file to reverse, it might be better to use a
> temporary file.

Yeah, or an in-memory deque if you know how many lines you want.
Either way, you can read the file forwards, guaranteeing correct
decoding even of a shifted character set (where a byte value can
change in meaning based on arbitrarily distant context).

ChrisA


More information about the Python-list mailing list