read a unicode file
Alan Kennedy
alanmk at hotmail.com
Tue Jun 10 06:53:02 EDT 2003
Alan Kennedy:
>> Why would the 'utf-16' codec not support readline?
Martin v. Löwis:
> The problem is that the codec's .readline usually invokes the
> .readline of the underlying stream. For UTF-16, this fails, since
> .readline of the stream sometimes will break at the next \n character,
> which means that there might be a dangling second byte (which might
> not be NUL, in which case .readline has misinterpreted the \n byte).
> On other systems, .readline may fail to find a \r\n sequence (since
> there are interspersed NUL bytes), or it may find a \r\n sequence,
> but that would not be a line break, but the character U+2573.
Thanks Martin.
But I'm still confused. I understand why the underlying "readline" cannot be
relied upon: i.e. because it doesn't understand that it's dealing with 2 byte
chars, and would mix up characters where the high (or low) byte was a "\n" or
0x0A.
But, once you know that you're dealing with 2 byte chars, should it not be a lot
easier? I thought that UTF-16 represented every single character as 16 bits.
Which would mean that line endings should be easy to recognise:
\n == U+000A
\r == U+000D
I know that Unicode is much more complex than a simple integer -> glyph mapping.
However, it seems to me that in this situation, it should be straightforward.
--
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/mailto/alan
More information about the Python-list
mailing list