Newbie question about text encoding

Laura Creighton lac at openend.se
Tue Feb 24 10:24:00 EST 2015


In a message of Wed, 25 Feb 2015 02:10:42 +1100, Chris Angelico writes:
>On Wed, Feb 25, 2015 at 2:07 AM, Laura Creighton <lac at openend.se> wrote:
>>>Can you be sure it's Latin-1? I'm not certain of that. In any case, I
>>>never advocate fixing encoding problems by "just do this and it'll all
>>>go away"; you have to understand your data before you can decode it.
>>>
>>>ChrisA
>>
>> I can, I speak French and I recognise the data.  It's French place names,
>> places where sporting events are held. :)
>
>Ah, okay. :) But even with that level of confidence, you still have to
>pick between Latin-1 and CP-1252, which you can't tell based on this
>one snippet. Welcome to untagged encodings.
>
>ChrisA

Ah, yes, you are right about that.  I see CP-1252 about 2 times every 10
years, and latin1 every minute of my life, so I am biased to assume I
know what I am seeing.

ChrisA, you come from an English speaking country, right?

For those of us who come from countries whose language doesn't fit in
ASCII, the notion of 'understand the data' doesn't work very well.  We
already understand the data -- its a set of words in our native language.
The hard part isn't understanding the data, but rather understanding how
the hell Python could be so stupid as to not understand it. :)  The
notion that Python normally only understands the subset of the
characters in your native language than English speakers use in their
language is not the most obvious thing.

And having taught countless European kids how to write their very first
program in Python, I can tell you for certain that the sort of deep
understanding of encoding methods is not what 10 year olds who just
want to print out the names of their friends, and their favourite
music titles, and their favourite musicians want to know. :)

Laura



More information about the Python-list mailing list