tail

Sat May 7 16:12:01 EDT 2022

On Sun, 8 May 2022 at 04:37, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
>
> On Sat, 7 May 2022 at 19:02, MRAB <python at mrabarnett.plus.com> wrote:
> >
> > On 2022-05-07 17:28, Marco Sulla wrote:
> > > On Sat, 7 May 2022 at 16:08, Barry <barry at barrys-emacs.org> wrote:
> > >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> > >
> > >>>> "\n".encode("utf-16")
> > > b'\xff\xfe\n\x00'
> > >>>> "".encode("utf-16")
> > > b'\xff\xfe'
> > >>>> "a\nb".encode("utf-16")
> > > b'\xff\xfea\x00\n\x00b\x00'
> > >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > > b'\n\x00'
> > >
> > > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
> >
> > In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> > could be little-endian or big-endian.
> >
> > As you didn't specify which you wanted, it defaulted to little-endian
> > and added a BOM (U+FEFF).
> >
> > If you specify which endianness you want with "utf-16le" or "utf-16be",
> > it won't add the BOM:
> >
> >  >>> # Little-endian.
> >  >>> "\n".encode("utf-16le")
> > b'\n\x00'
> >  >>> # Big-endian.
> >  >>> "\n".encode("utf-16be")
> > b'\x00\n'
>
> Well, ok, but I need a generic method to get LF and CR for any
> encoding an user can input.
> Do you think that
>
> "\n".encode(encoding).lstrip("".encode(encoding))
>
> is good for any encoding?

No, because it is only useful for stateless encodings. Any encoding
which uses "shift bytes" that cause subsequent bytes to be interpreted
differently will simply not work with this naive technique. Also,
you're assuming that the byte(s) you get from encoding LF will *only*
represent LF, which is also not true for a number of other encodings -
they might always encode LF to the same byte sequence, but could use
that same byte sequence as part of a multi-byte encoding. So, no, for
arbitrarily chosen encodings, this is not dependable.

> Furthermore, is there a way to get the
> encoding of an opened file object?

Nope. That's fundamentally not possible. Unless you mean in the
trivial sense of "what was the parameter passed to the open() call?",
in which case f.encoding will give it to you; but to find out the
actual encoding, no, you can't.

The ONLY way to 100% reliably decode arbitrary text is to know, from
external information, what encoding it is in. Every other scheme
imposes restrictions. Trying to do something that works for absolutely
any encoding is a doomed project.

ChrisA