tail

MRAB python at mrabarnett.plus.com
Sun May 8 14:50:21 EDT 2022


On 2022-05-08 19:15, Barry Scott wrote:
> 
> 
>> On 7 May 2022, at 22:31, Chris Angelico <rosuav at gmail.com> wrote:
>> 
>> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>>> 
>>> MRAB <python at mrabarnett.plus.com> writes:
>>>> On 2022-05-07 19:47, Stefan Ram wrote:
>>> ...
>>>>> def encoding( name ):
>>>>>   path = pathlib.Path( name )
>>>>>   for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>>>       try:
>>>>>           with path.open( encoding=encoding, errors="strict" )as file:
>>>>>               text = file.read()
>>>>>           return encoding
>>>>>       except UnicodeDecodeError:
>>>>>           pass
>>>>>   return "ascii"
>>>>> Yes, it's potentially slow and might be wrong.
>>>>> The result "ascii" might mean it's a binary file.
>>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>>> by that encoding.
>>> 
>>>  Thank you! It's working for my specific application where
>>>  I'm reading from a collection of text files that should be
>>>  encoded in either utf_8, latin_1, or ascii.
>>> 
>> 
>> In that case, I'd exclude ASCII from the check, and just check UTF-8,
>> and if that fails, decode as Latin-1. Any ASCII files will decode
>> correctly as UTF-8, and any file will decode as Latin-1.
>> 
>> I've used this exact fallback system when decoding raw data from
>> Unicode-naive servers - they accept and share bytes, so it's entirely
>> possible to have a mix of encodings in a single stream. As long as you
>> can define the span of a single "unit" (say, a line, or a chunk in
>> some form), you can read as bytes and do the exact same "decode as
>> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
>> perfectly ideal, but it's about as good as you'll get with a lot of
>> US-based servers. (Depending on context, you might use CP-1252 instead
>> of Latin-1, but you might need errors="replace" there, since
>> Windows-1252 has some undefined byte values.)
> 
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
> 
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
> 
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
> 
Is it CP-1252 or ISO-8859-1 (Latin-1)?


More information about the Python-list mailing list