Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Tue Feb 24 10:33:30 EST 2015


On Wed, Feb 25, 2015 at 2:24 AM, Laura Creighton <lac at openend.se> wrote:
> Ah, yes, you are right about that.  I see CP-1252 about 2 times every 10
> years, and latin1 every minute of my life, so I am biased to assume I
> know what I am seeing.

Fair enough. CP-1252 is still a possibility, but the difference can be
dealt with later.

> ChrisA, you come from an English speaking country, right?

Yes (Australia, to be specific).

> For those of us who come from countries whose language doesn't fit in
> ASCII, the notion of 'understand the data' doesn't work very well.  We
> already understand the data -- its a set of words in our native language.
> The hard part isn't understanding the data, but rather understanding how
> the hell Python could be so stupid as to not understand it. :)  The
> notion that Python normally only understands the subset of the
> characters in your native language than English speakers use in their
> language is not the most obvious thing.

Also a reasonable baseline assumption; but the trouble is that if you
automatically assume that text is encoded in your favourite eight-bit
system, you're taking a huge risk.

Now, you have a huge leg up on me, in that you actually recognize the
*words* in that piece of text. That means you can have MUCH greater
confidence in stating that it's Latin-1 than I can. But that's
precisely what I mean by "understand the data". If you, being a native
French speaker, pick up a file written in (say) Polish, and encoded
Latin-2, you'll recognize by the ASCII characters that it's not French
text, and probably you'd be able to spot that it ought to be Latin-2
rather than Latin-1. That's understanding the data, that's having more
information than just the byte patterns. A computer can't reliably do
that (just look up the "Bush hid the facts" bug if you don't believe
me), but a human often can.

> And having taught countless European kids how to write their very first
> program in Python, I can tell you for certain that the sort of deep
> understanding of encoding methods is not what 10 year olds who just
> want to print out the names of their friends, and their favourite
> music titles, and their favourite musicians want to know. :)

Right, so you should be teaching them to use Python 3, and always
saving everything in UTF-8, and basically ignoring the whole mess of
eight-bit encodings :)

ChrisA



More information about the Python-list mailing list