tail

Sat May 7 17:31:48 EDT 2022

On Sun, 8 May 2022 at 07:19, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>
> MRAB <python at mrabarnett.plus.com> writes:
> >On 2022-05-07 19:47, Stefan Ram wrote:
> ...
> >>def encoding( name ):
> >>    path = pathlib.Path( name )
> >>    for encoding in( "utf_8", "latin_1", "cp1252" ):
> >>        try:
> >>            with path.open( encoding=encoding, errors="strict" )as file:
> >>                text = file.read()
> >>            return encoding
> >>        except UnicodeDecodeError:
> >>            pass
> >>    return "ascii"
> >>Yes, it's potentially slow and might be wrong.
> >>The result "ascii" might mean it's a binary file.
> >"latin-1" will decode any sequence of bytes, so it'll never try
> >"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >anyway because the file could contain 0x80..0xFF, which aren't supported
> >by that encoding.
>
>   Thank you! It's working for my specific application where
>   I'm reading from a collection of text files that should be
>   encoded in either utf_8, latin_1, or ascii.
>

In that case, I'd exclude ASCII from the check, and just check UTF-8,
and if that fails, decode as Latin-1. Any ASCII files will decode
correctly as UTF-8, and any file will decode as Latin-1.

I've used this exact fallback system when decoding raw data from
Unicode-naive servers - they accept and share bytes, so it's entirely
possible to have a mix of encodings in a single stream. As long as you
can define the span of a single "unit" (say, a line, or a chunk in
some form), you can read as bytes and do the exact same "decode as
UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
perfectly ideal, but it's about as good as you'll get with a lot of
US-based servers. (Depending on context, you might use CP-1252 instead
of Latin-1, but you might need errors="replace" there, since
Windows-1252 has some undefined byte values.)

ChrisA