UTF-8 and latin1

Wed Aug 17 20:20:53 EDT 2022

On 2022-08-17, Barry <barry at barrys-emacs.org> wrote:
>> On 17 Aug 2022, at 18:30, Jon Ribbens via Python-list <python-list at python.org> wrote:
>> On 2022-08-17, Tobiah <toby at tobiah.org> wrote:
>>> I get data from various sources; client emails, spreadsheets, and
>>> data from web applications.  I find that I can do some_string.decode('latin1')
>>> to get unicode that I can use with xlsxwriter,
>>> or put <meta charset="latin1"> in the header of a web page to display
>>> European characters correctly.  But normally UTF-8 is recommended as
>>> the encoding to use today.  latin1 works correctly more often when I
>>> am using data from the wild.  It's frustrating that I have to play
>>> a guessing game to figure out how to use incoming text.   I'm just wondering
>>> if there are any thoughts.  What if we just globally decided to use utf-8?
>>> Could that ever happen?
>> 
>> That has already been decided, as much as it ever can be. UTF-8 is
>> essentially always the correct encoding to use on output, and almost
>> always the correct encoding to assume on input absent any explicit
>> indication of another encoding. (e.g. the HTML "standard" says that
>> all HTML files must be UTF-8.)
>> 
>> If you are finding that your specific sources are often encoded with
>> latin-1 instead then you could always try something like:
>> 
>>    try:
>>        text = data.decode('utf-8')
>>    except UnicodeDecodeError:
>>        text = data.decode('latin-1')
>> 
>> (I think latin-1 text will almost always fail to be decoded as utf-8,
>> so this would work fairly reliably assuming those are the only two
>> encodings you see.)
>
> Only if a reserved byte is used in the string.
> It will often work in either.

Because it's actually ASCII and hence there's no difference between
interpreting it as utf-8 or iso-8859-1? In which case, who cares?

> For web pages it cannot be assumed that markup saying it’s utf-8 is
> correct. Many pages are I fact cp1252. Usually you find out because
> of a smart quote that is 0xa0 is cp1252 and illegal in utf-8.

Hence what I said above. But if a source explicitly states an encoding
and it's false then these days I see little need for sympathy.