UTF-8 and latin1

Wed Aug 17 16:53:17 EDT 2022

On 18/08/2022 03.33, Stefan Ram wrote:
> Tobiah <toby at tobiah.org> writes:
>> I get data from various sources; client emails, spreadsheets, and
>> data from web applications.  I find that I can do some_string.decode('latin1')
> 
>   Strings have no "decode" method. ("bytes" objects do.)
> 
>> to get unicode that I can use with xlsxwriter,
>> or put <meta charset="latin1"> in the header of a web page to display
>> European characters correctly.
> 
> |You should always use the UTF-8 character encoding. (Remember
> |that this means you also need to save your content as UTF-8.)
> World Wide Web Consortium (W3C) (2014)
> 
>> am using data from the wild.  It's frustrating that I have to play
>> a guessing game to figure out how to use incoming text.   I'm just wondering
> 
>   You can let Python guess the encoding of a file.
> 
> def encoding_of( name ):
>     path = pathlib.Path( name )
>     for encoding in( "utf_8", "cp1252", "latin_1" ):
>         try:
>             with path.open( encoding=encoding, errors="strict" )as file:
>                 text = file.read()
>             return encoding
>         except UnicodeDecodeError:
>             pass
>     return None
> 
>> if there are any thoughts.  What if we just globally decided to use utf-8?
>> Could that ever happen?
> 
>   That decisions has been made long ago.

Unfortunately, much of our data was collected long before then - and as
we've discovered, the OP is still living in Python 2 times.

What about if the path "name" (above) is not in utf-8?
eg the OP's Montréal in Latin1, as Montréal.txt or Montréal.rpt
-- 
Regards,
=dn