I've got the unicode blues

Martin von Loewis loewis at informatik.hu-berlin.de
Tue Mar 5 04:36:26 EST 2002


gerson.kurz at t-online.de (Gerson Kurz) writes:

> Coming from a lowlevel assembler/C background, the most intuitive way
> of understanding the whole messy thing is this: ASCII characters are
> one byte each, UNICODE characters are two byte each. That is not
> correct, but its a pragmatic way of viewing things that has worked so
> far on the Windows implementation of UNICODE. 

Notice however, that for Unicode, there might be a difference between
on-disk representation and in-memory representation. Also notice that
different on-disk representations are in use (UTF-8, UTF-16, UTF-32,
...)

> The first two initial bytes look suspicious, but all bytes after that
> are like expected: two bytes per character, and the latin letters look
> like their ASCII counterpart. Some searching at http://www.unicode.org
> reveals that the first two bytes are identifiers for UNICODE files
> like this, called "BOM". 

Correct.

> Back then, in the times of 2.1, two solutions were proposed, neither
> of which work:
> 
> >>> unicode(open('test.reg').read(), 'utf-8')
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
> UnicodeError: UTF-8 decoding error: unexpected code byte

That can't work; the file is not encoded in UTF-8. UTF-8 is a Unicode
encoding where each character does *not* have two bytes; characters
from the ASCII subrange need only one byte, whereas, e.g. umlauts need
two bytes; Chinese characters even need three bytes.

> OK, after a bit of searching I suspect I might have to go for utf-16,
> because that seems (to my limited UNICODE knowledge) like its the
> two-byte-codec I was looking for:
> 
> >>> encode, decode, reader, writer = codecs.lookup('utf-16')
> >>> f = reader(open('test.reg'))
> >>> print f.readlines()
> []

That should work, in theory. However, a better way to spell it is

f = codecs.open('test.reg', encoding='utf-16')

> Those are unexpected results in my view. I suspect the reason is the
> BOM is not handled by those functions. 

There was a bug in the UTF-16 stream reader of Python 2.1, where it
would not remember the byte order across .readline invocations. That
bug is fixed in Python 2.2.

> My take on this is, that the builtin file-readlines() *should* really
> know about BOM and return UNICODE strings if the file has a BOM. 

They do now.

> I volunteer to patch the readlines function, if some other people
> out there feel that this is right, too.

Please try Python 2.2 first; if there is any remaing problem, report
it to SF.

> My first foolish attempt:
> 
> lines = ReadLinesFromAnything("test.reg")
> file = open("test.out","wb")
> assert(type(lines[0])==types.UnicodeType)
> file.write(lines[0])
> file.close()
> 
> gives me the first line, all right, but in ASCII, not UNICODE! 

Yes, a Unicode string must be converted to a byte string before you
can write it to a file. To do that, Python uses the encoding returned
in sys.getdefaultencoding(); in the standard installation, that is
"ascii". It was considered the most useful value for a general-purpose
encoding, since it is also use when you do things like

  unistring += "HKLM\foo\bar"

Here, the byte string on the right-hand-side must be converted to a
Unicode string before the two can be added - this is another place
where the default encoding is used.

Normally, you'll get an error if the default encoding is used in the
"wrong" place; this appears one of the few cases where it silently
does the wrong thing.

> file.write(unicode("test3","utf-16"))
> 
> which raises the following exception:
> 
> File "D:\Scripts\2002\02\read-unicode-lines.py", line 20, in ?
>     file.write(unicode("test3","utf-16"))
> UnicodeError: UTF-16 decoding error: truncated data

No surprise: the byte sequence '\x74\x65\x73\x74\x33' does not
constitute a valid UTF-16 encoding.

> when I desperately try this:
> 
> data = unicode(lines[0],"utf-16")
> file.write(data)
> 
> I get the exception
> 
> File "D:\Scripts\2002\02\read-unicode-lines.py", line 20, in ?
>     data = unicode(lines[0],"utf-16")
> TypeError: decoding Unicode is not supported

No surprise either: unicode() is a function that constructs a Unicode
string, given a byte string (similar to int(), list(), or
tuple()). What you want to do is to create a byte string, given a
Unicode string; to do that, invoke the string's .encode method

  file.write(lines[0].encode("utf-16"))

> The hexdump looks OK, too. But, when I try to write multiple strings,
> I run into trouble again, because each string is prefixed with the
> BOM, and not the file only:
> 
> lines = ReadLinesFromAnything("test.reg")
> file = open("test.out","wb")
> for line in lines:
>     file.write(line.encode("utf-16"))
> file.close()

In Python 2.2, this can be rewritten as

lines = codecs.open('test.reg',encoding='utf-16').readlines()
file = codecs.open('test.reg','w',encoding='utf-16')
file.writelines(lines)

Due to the bug in Python 2.1, you do indeed get multiple copies of the
BOM; to work around this, you can do

lines = ReadLinesFromAnything("test.reg")
file = open("test.out","wb")
file.write(codecs.BOM_LE)
for line in lines:
    file.write(line.encode("utf-16le"))
file.close()

The LE/BE codecs don't write a BOM, so they operate nicely using
sequential write. In fact, this is what the UTF-16 codec does: it
first writes the BOM, then writes the data in the chosen endianness.

> - *much* better support for UNICODE Textfiles in python
> - a *much* better documentation on this in python.
> 
> So now I already feel much better :)

I hope I could answer some of your concerns. Contributions of
documentation are welcome. To the original Python Unicode
infrastructure authors, it is just not clear what problems people run
into, as it is not clear how people *expect* that these things should
work.

Regards,
Martin



More information about the Python-list mailing list