Python 3.0 automatic decoding of UTF16

Joe Strout joe at strout.net
Fri Dec 5 14:00:59 EST 2008


On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote:

>> I suspect that '?' after \n (\u0a00) is indicates not 'question-mark'
>> but 'uninterpretable as a utf16 character'.  The traceback below
>> confirms that.  It should be an end-of-file marker and should not be
>> passed to Python.  I strongly suspect that whatever wrote the file
>> screwed up the (OS-specific) end-of-file marker.  I have seen this
>> occasionally on Dos/Windows with ascii byte files, with the same  
>> symptom
>> of reading random garbage pass the end of the file.  Or perhaps
>> end-of-file does not work right with utf16.
>
> So UTF-16 has an explicit EOF marker within the text?

No, it does not.  I don't know what Terry's thinking of there, but  
text files do not have any EOF marker.  They start at the beginning  
(sometimes including a byte-order mark), and go till the end of the  
file, period.

> I cannot find one in original file, only some kind of starting  
> sequence I suppose
> (0xfeff).

That's your byte-order mark (BOM).

> The last characters of the file are 0x00 0x0d 0x00 0x0a,
> simple \r\n line ending.

Sounds like a perfectly normal file to me.

It's hard to imagine, but it looks to me like you've found a bug.

Best,
- Joe




More information about the Python-list mailing list