UnicodeDecodeError issue

Dave Angel davea at davea.name
Wed Sep 4 07:26:18 EDT 2013


On 4/9/2013 04:35, Ferrous Cranus wrote:

> Τη Δευτέρα, 2 Σεπτεμβρίου 2013 9:28:36 μ.μ. UTC+3, ο χρήστης Dave Angel έγραψε:
>> On 2/9/2013 11:05, Ferrous Cranus wrote:
>> 
>> 
>> 
>> > Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
>> 
>> >> Starting with the byte string in the error message:
>> 
>> >>>>> f = open("junk.txt", "w")
>> 
>> >>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>> 
>> >>>>> f.close()
>> 
>> >
>> 
>> >
>> 
>> > Ιndeed but yet again, file checks out the encoding of the filename that 
>> 
>> > consists of these lines above, not of the actual strings.
>> 
>> >
>> 
>> >
>> 
>> 
>> 
>> 'file' does nothing interesting with the filename, it just opens it and
>> 
>> examines the contents.  For example,
>> 
>> 
>> 
>> file www/cgi-bin/files.py
>> 
>> 
>> 
>> will examine the Python source file, not run it.
>> 
>> 
>> 
>> So first in the interpreter, I ran
>> 
>> 
>> 
>> >>>> f = open("junk.txt", "w")
>> 
>> >>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>> 
>> >>>> f.close()
>> 
>> 
>> 
>> then at the bash prompt, I ran:
>> 
>> 
>> 
>> davea at think2:~$ file junk.txt 
>> 
>> junk.txt: ISO-8859 text
>
>
> That is one Clever Idea Dave.
>
> I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

'file' only guesses the most likely encoding for 'junk.txt'  But at
least it can know it's not utf-8, since that would give an decoding
error.

That's why, whenever 'file' makes its verdict, it's up to you to check
it by displaying the data after decoding it with that tentative
encoding.

>
> But wait a minute: What editor do you uses to write these 3 lines?
> I mean am a bit confused.

As I said right above, "in the interpreter, I ran"...
And if that's not clear enough, you can see the >>>> prompts that the
Python interpreter uses.  By interpeter, I mean I ran Python with no
parameters.  I did not run IDLE or any other IDE, that might take it
upon itself to interfere.


>
> i for example i 'nano tets.py' which has within:
>
> f = open("junk.txt", "w") 
> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n') 
> f.close() 
>
> then when i save the file within nano for example by default in utf-8 charset

That's the encoding for the file tets.py, and you'll notice that it's
actually ASCII.  Notice that the string I copied from the error message
uses escape sequences for all non-ASCII bytes.

>
> how would it be able to detect the bytestring within that is supposed to be of greek-iso's

I wouldn't be running 'file' on the tets.py file, but on the junk.txt
file created when you run
    python tets.py

So since the tets.py file was a sidetrack, I just ran those three lines
in the interpreter.

-- 
DaveA





More information about the Python-list mailing list