UnicodeDecodeError issue

Wed Sep 4 23:07:43 EDT 2013

On Thu, 05 Sep 2013 00:17:36 +0000, Dave Angel wrote:

> On 4/9/2013 10:29, Ferrous Cranus wrote:
> 
>> Στις 4/9/2013 3:38 μμ, ο/η Dave Angel έγραψε:
>>> 'file' isn't magic.  And again, it doesn't look at the filename, it
>>> looks at the content.
>> So, you are saying that it looks a the content of the file and not of
>> what encoding we used to save the file into?
> 
> That's right.  There's no place where your text editor stores the
> encoding it used, so 'file' has to guess, based only on the content.

Correct. The thing that people often fail to understand is that there is 
no *reliable* way to store the encoding used for a text file in the text 
file itself. The encoding is *metadata*, not data: it is data about the 
data, and consequently it has to be stored "out of band". It has to be 
stored somewhere else, outside of the file.

In the case of text files, it is usually not stored anywhere at all. IBM 
mainframes assume that text files are using EBCDIC; modern Linux systems 
assume text files are UTF-8; old DOS applications assume text files are 
ASCII. Some text editors will try to guess the encoding, using various 
heuristics such as "if the file starts with \xFE\xFF it is UTF-16" but 
none of them are foolproof:

http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx

sometimes with amusing consequences:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

>> But the contents have within:
>>
>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1
>> \xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>
>> so it should have said greek-iso and not ascii.

But the above byte string is also valid ISO-8859-5 (Cyrillic):

'Жуэљѓєяќэяьсѓ\x0fѓєоьсєяђ\n'

ISO-8859-2 (Central European):

'śăíůóôďüíďěáó\x0fóôŢěáôďň\n'

and ISO-8859-4 (Baltic):

'ļãíųķôīüíīėáķ\x0fķôŪėáôīō\n'

Surely you don't expect the file utility to actually recognise that 
'Άγνωστοόνομασ\x0fστήματος\n' makes a valid Greek phrase while the others 
are not meaningful?

> No, that line is totally ASCII.  Only when it's EXECUTED by Python will
> a non ASCII byte string object be created.  Like I said, 'file' doesn't
> know the first thing about Python syntax, nor should it.

Technically, it's not ASCII, since ASCII only knows about bytes \x00 
through \x7F (decimal 0 through 127). That's why it isn't correct to 
describe Python bytes strings as "ASCII strings". They're byte strings 
that happen to be displayed as ASCII-plus-other-stuff.

-- 
Steven