read a unicode file

Alan Kennedy alanmk at hotmail.com
Tue Jun 10 06:53:02 EDT 2003


Alan Kennedy:

>> Why would the 'utf-16' codec not support readline?

Martin v. Löwis:

> The problem is that the codec's .readline usually invokes the
> .readline of the underlying stream. For UTF-16, this fails, since
> .readline of the stream sometimes will break at the next \n character,
> which means that there might be a dangling second byte (which might
> not be NUL, in which case .readline has misinterpreted the \n byte).
> On other systems, .readline may fail to find a \r\n sequence (since
> there are interspersed NUL bytes), or it may find a  \r\n sequence,
> but that would not be a line break, but the character U+2573.

Thanks Martin.

But I'm still confused. I understand why the underlying "readline" cannot be
relied upon: i.e. because it doesn't understand that it's dealing with 2 byte
chars, and would mix up characters where the high (or low) byte was a "\n" or
0x0A.

But, once you know that you're dealing with 2 byte chars, should it not be a lot
easier? I thought that UTF-16 represented every single character as 16 bits.
Which would mean that line endings should be easy to recognise:

\n == U+000A
\r == U+000D

I know that Unicode is much more complex than a simple integer -> glyph mapping.
However, it seems to me that in this situation, it should be straightforward.

-- 
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/mailto/alan




More information about the Python-list mailing list