UTF-8 and latin1

Wed Aug 17 11:48:53 EDT 2022

On 2022-08-17, Tobiah <toby at tobiah.org> wrote:
> I get data from various sources; client emails, spreadsheets, and
> data from web applications.  I find that I can do some_string.decode('latin1')
> to get unicode that I can use with xlsxwriter,
> or put <meta charset="latin1"> in the header of a web page to display
> European characters correctly.  But normally UTF-8 is recommended as
> the encoding to use today.  latin1 works correctly more often when I
> am using data from the wild.  It's frustrating that I have to play
> a guessing game to figure out how to use incoming text.   I'm just wondering
> if there are any thoughts.  What if we just globally decided to use utf-8?
> Could that ever happen?

That has already been decided, as much as it ever can be. UTF-8 is
essentially always the correct encoding to use on output, and almost
always the correct encoding to assume on input absent any explicit
indication of another encoding. (e.g. the HTML "standard" says that
all HTML files must be UTF-8.)

If you are finding that your specific sources are often encoded with
latin-1 instead then you could always try something like:

    try:
        text = data.decode('utf-8')
    except UnicodeDecodeError:
        text = data.decode('latin-1')

(I think latin-1 text will almost always fail to be decoded as utf-8,
so this would work fairly reliably assuming those are the only two
encodings you see.)

Or you could use something fancy like https://pypi.org/project/chardet/