UnicodeDecodeError issue

Ferrous Cranus nikos at superhost.gr
Wed Sep 4 07:38:41 EDT 2013


Στις 4/9/2013 2:26 μμ, ο/η Dave Angel έγραψε:
> On 4/9/2013 04:35, Ferrous Cranus wrote:
>
>> Τη Δευτέρα, 2 Σεπτεμβρίου 2013 9:28:36 μ.μ. UTC+3, ο χρήστης Dave Angel έγραψε:
>>> On 2/9/2013 11:05, Ferrous Cranus wrote:
>>>
>>>
>>>
>>>> Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγραψε:
>>>
>>>>> Starting with the byte string in the error message:
>>>
>>>>>>>> f = open("junk.txt", "w")
>>>
>>>>>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>
>>>>>>>> f.close()
>>>
>>>>
>>>
>>>>
>>>
>>>> Ιndeed but yet again, file checks out the encoding of the filename that
>>>
>>>> consists of these lines above, not of the actual strings.
>>>
>>>>
>>>
>>>>
>>>
>>>
>>>
>>> 'file' does nothing interesting with the filename, it just opens it and
>>>
>>> examines the contents.  For example,
>>>
>>>
>>>
>>> file www/cgi-bin/files.py
>>>
>>>
>>>
>>> will examine the Python source file, not run it.
>>>
>>>
>>>
>>> So first in the interpreter, I ran
>>>
>>>
>>>
>>>>>>> f = open("junk.txt", "w")
>>>
>>>>>>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>>>
>>>>>>> f.close()
>>>
>>>
>>>
>>> then at the bash prompt, I ran:
>>>
>>>
>>>
>>> davea at think2:~$ file junk.txt
>>>
>>> junk.txt: ISO-8859 text
>>
>>
>> That is one Clever Idea Dave.
>>
>> I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?
>
> 'file' only guesses the most likely encoding for 'junk.txt'  But at
> least it can know it's not utf-8, since that would give an decoding
> error.
>
> That's why, whenever 'file' makes its verdict, it's up to you to check
> it by displaying the data after decoding it with that tentative
> encoding.
>
>>
>> But wait a minute: What editor do you uses to write these 3 lines?
>> I mean am a bit confused.
>
> As I said right above, "in the interpreter, I ran"...
> And if that's not clear enough, you can see the >>>> prompts that the
> Python interpreter uses.  By interpeter, I mean I ran Python with no
> parameters.  I did not run IDLE or any other IDE, that might take it
> upon itself to interfere.
>
>
>>
>> i for example i 'nano tets.py' which has within:
>>
>> f = open("junk.txt", "w")
>> f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
>> f.close()
>>
>> then when i save the file within nano for example by default in utf-8 charset
>
> That's the encoding for the file tets.py, and you'll notice that it's
> actually ASCII.  Notice that the string I copied from the error message
> uses escape sequences for all non-ASCII bytes.
>
>>
>> how would it be able to detect the bytestring within that is supposed to be of greek-iso's
>
> I wouldn't be running 'file' on the tets.py file, but on the junk.txt
> file created when you run
>      python tets.py
>
> So since the tets.py file was a sidetrack, I just ran those three lines
> in the interpreter.
>
I'm still consused about this.

say we save those 3 lines inside junk.txt and we save it by default as utf-8

when we 'file junk.txt'

what will file respond with?

filename's charset?

or

will it llook at the bystering within to decide what encoding it uses?

fi

-- 
Webhost <http://superhost.gr>



More information about the Python-list mailing list