tail

Sun May 8 14:15:09 EDT 2022

> On 7 May 2022, at 22:31, Chris Angelico <rosuav at gmail.com> wrote:
> 
> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>> 
>> MRAB <python at mrabarnett.plus.com> writes:
>>> On 2022-05-07 19:47, Stefan Ram wrote:
>> ...
>>>> def encoding( name ):
>>>>   path = pathlib.Path( name )
>>>>   for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>>       try:
>>>>           with path.open( encoding=encoding, errors="strict" )as file:
>>>>               text = file.read()
>>>>           return encoding
>>>>       except UnicodeDecodeError:
>>>>           pass
>>>>   return "ascii"
>>>> Yes, it's potentially slow and might be wrong.
>>>> The result "ascii" might mean it's a binary file.
>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>> by that encoding.
>> 
>>  Thank you! It's working for my specific application where
>>  I'm reading from a collection of text files that should be
>>  encoded in either utf_8, latin_1, or ascii.
>> 
> 
> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> and if that fails, decode as Latin-1. Any ASCII files will decode
> correctly as UTF-8, and any file will decode as Latin-1.
> 
> I've used this exact fallback system when decoding raw data from
> Unicode-naive servers - they accept and share bytes, so it's entirely
> possible to have a mix of encodings in a single stream. As long as you
> can define the span of a single "unit" (say, a line, or a chunk in
> some form), you can read as bytes and do the exact same "decode as
> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> perfectly ideal, but it's about as good as you'll get with a lot of
> US-based servers. (Depending on context, you might use CP-1252 instead
> of Latin-1, but you might need errors="replace" there, since
> Windows-1252 has some undefined byte values.)

There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.

Barry

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list
>