Newbie question about text encoding

Tue Feb 24 12:13:24 EST 2015

On 02/24/2015 11:20 AM, Laura Creighton wrote:
> In a message of Wed, 25 Feb 2015 02:33:30 +1100, Chris Angelico writes:
>> Also a reasonable baseline assumption; but the trouble is that if you
>> automatically assume that text is encoded in your favourite eight-bit
>> system, you're taking a huge risk.
>
> But, you know, I wasn't assuming this.  I actually read latin1.  I
> could read it in ascii, know that \xe9  means 'é', a letter combination
> that we have in Swedish, so I am rather used to reading, and then
> well, I could read all of his strings, know they were in French,
> and know that latin1 was what he needed things to be decoded to.

With a sample of one string, how did you read "all his strings".  And 
with one non-ASCII code in that single string, how did you know that 
'latin1' was the only encoding that included a reasonable character at 
that encoding?

>
>> Now, you have a huge leg up on me, in that you actually recognize the
>> *words* in that piece of text. That means you can have MUCH greater
>> confidence in stating that it's Latin-1 than I can. But that's
>> precisely what I mean by "understand the data". If you, being a native
>> French speaker, pick up a file written in (say) Polish, and encoded
>> Latin-2, you'll recognize by the ASCII characters that it's not French
>> text, and probably you'd be able to spot that it ought to be Latin-2
>> rather than Latin-1. That's understanding the data, that's having more
>> information than just the byte patterns. A computer can't reliably do
>> that (just look up the "Bush hid the facts" bug if you don't believe
>> me), but a human often can.
>
> Absolutely correct.  But you must not require that all of the speakers
> of non-English languages think about their languages as 'special
> encodings'.  Only the monoglot ever think of a foreign language as
> a code.

All languages are foreign.  All that can be written to a disk file are 
bytes.  Those have to have been encoded to represent some abstraction 
called a character set, or string.  The question is whether the encoding 
method is specified for the particular file type, or for the particular 
file.

See http://support.esri.com/cn/knowledgebase/techarticles/detail/21106

according to that page, starting at ArcGIS 10.2.1, the default sets the 
code page to UTF-8 (UNICODE) in the shapefile (.DBF)

But in earlier ones, there's supposed to be a reference to the codepage 
used.  From that, one can presumably derive which decoder to use.

>
> That poor guy the original poster just wants to have a nice string
> of his sporting event place name.  We should tell him how to get that,
> not how to be an expert in all the encodings on the face of this earth.
> Chances are, the only thing he needs to talk about are French words.
>
> If not, well, he will come back when things stop working, and have lots
> more data to give him.  If, instead, this makes him go away happy, then
> this was the very best thing to do.
>
>>> And having taught countless European kids how to write their very first
>>> program in Python, I can tell you for certain that the sort of deep
>>> understanding of encoding methods is not what 10 year olds who just
>>> want to print out the names of their friends, and their favourite
>>> music titles, and their favourite musicians want to know. :)
>>
>> Right, so you should be teaching them to use Python 3, and always
>> saving everything in UTF-8, and basically ignoring the whole mess of
>> eight-bit encodings :)
>
> Of course this makes sense.  But you seem to be missing the point.
> People who are asking for help in getting things to work in their
> native language need a 'do this quick' sort of answer.  The deeper
> problems of supporting all languages and language encodings can very
> much wait.  The OP wants a hunk of bytes that happens to mean
> something in French, and is not encodable in the limited English
> language to work like a different hunk of bytes that means something
> in French but is encodable.
>
> Don't overburden them.

My guess is that this is only appropriate for users who use only locally 
created data.  Since the OP's data is apparently old (if it were current 
versions, it'd have been utf-8), who knows how consistent the encoding is.

-- 
-- 
DaveA