Newbie question about text encoding

Laura Creighton lac at openend.se
Tue Feb 24 11:20:41 EST 2015


In a message of Wed, 25 Feb 2015 02:33:30 +1100, Chris Angelico writes:
>Also a reasonable baseline assumption; but the trouble is that if you
>automatically assume that text is encoded in your favourite eight-bit
>system, you're taking a huge risk.

But, you know, I wasn't assuming this.  I actually read latin1.  I
could read it in ascii, know that \xe9  means 'é', a letter combination
that we have in Swedish, so I am rather used to reading, and then
well, I could read all of his strings, know they were in French,
and know that latin1 was what he needed things to be decoded to.

>Now, you have a huge leg up on me, in that you actually recognize the
>*words* in that piece of text. That means you can have MUCH greater
>confidence in stating that it's Latin-1 than I can. But that's
>precisely what I mean by "understand the data". If you, being a native
>French speaker, pick up a file written in (say) Polish, and encoded
>Latin-2, you'll recognize by the ASCII characters that it's not French
>text, and probably you'd be able to spot that it ought to be Latin-2
>rather than Latin-1. That's understanding the data, that's having more
>information than just the byte patterns. A computer can't reliably do
>that (just look up the "Bush hid the facts" bug if you don't believe
>me), but a human often can.

Absolutely correct.  But you must not require that all of the speakers
of non-English languages think about their languages as 'special
encodings'.  Only the monoglot ever think of a foreign language as
a code.

That poor guy the original poster just wants to have a nice string
of his sporting event place name.  We should tell him how to get that,
not how to be an expert in all the encodings on the face of this earth.
Chances are, the only thing he needs to talk about are French words.

If not, well, he will come back when things stop working, and have lots
more data to give him.  If, instead, this makes him go away happy, then
this was the very best thing to do.

>> And having taught countless European kids how to write their very first
>> program in Python, I can tell you for certain that the sort of deep
>> understanding of encoding methods is not what 10 year olds who just
>> want to print out the names of their friends, and their favourite
>> music titles, and their favourite musicians want to know. :)
>
>Right, so you should be teaching them to use Python 3, and always
>saving everything in UTF-8, and basically ignoring the whole mess of
>eight-bit encodings :)

Of course this makes sense.  But you seem to be missing the point.
People who are asking for help in getting things to work in their
native language need a 'do this quick' sort of answer.  The deeper
problems of supporting all languages and language encodings can very
much wait.  The OP wants a hunk of bytes that happens to mean
something in French, and is not encodable in the limited English
language to work like a different hunk of bytes that means something
in French but is encodable.

Don't overburden them.

>ChrisA

Laura




More information about the Python-list mailing list