I've got the unicode blues

Tue Mar 5 02:25:41 EST 2002

I have a C++ tool that takes two Windows .REG files and computes
differences in them. The tool needs some reworking, so I thought why
not rewrite that in Python, after all, Python Is Cool (TM).

The trouble starts with Windows 2000: REGEDIT files are now UNICODE.
Now, I have never had to dig very deep into UNICODE, so let me first
recap my knowledge of that. 

Coming from a lowlevel assembler/C background, the most intuitive way
of understanding the whole messy thing is this: ASCII characters are
one byte each, UNICODE characters are two byte each. That is not
correct, but its a pragmatic way of viewing things that has worked so
far on the Windows implementation of UNICODE. (There are several
variants of UNICODE strings in use in windows, most notably the
difference being that the NT kernel strings have their length encoded
in the first word; however, all UNICODE strings seem to be two bytes
each, probably because that is most easy and fastest way to process
them). 

Lets take a look at a hexdump of such a REGEDIT generated file. 

000000: FFFE5700 69006E00 64006F00 77007300 ..W.i.n.d.o.w.s.
000010: 20005200 65006700 69007300 74007200  .R.e.g.i.s.t.r.
... (and so on)...

The first two initial bytes look suspicious, but all bytes after that
are like expected: two bytes per character, and the latin letters look
like their ASCII counterpart. Some searching at http://www.unicode.org
reveals that the first two bytes are identifiers for UNICODE files
like this, called "BOM". See
http://www.unicode.org/unicode/uni2book/u2.html, Chapter 2.7 for "BOM
- Byte Order Marks")

So, the "old" C++ programm, when it sees that the first two bytes are
0xFFFE, converts the rest of the file to ASCII characters, using a
Windows function (WideCharToMultiByte), and then uses the normal C
string functions on ASCII characters. Of course, the international
characters UNICODE was invented for in the first place will get lost
in such a situation, but it seemed to work fine for my needs. You can
see the original C++ solution including the sourcecode here: 

http://www.p-nand-q.com/tools/regdiff.htm

OK, that is basically my knowledge on the subject. Limited, granted,
but it has sufficed so far in my pythonless past. 

Task 1: Reading in (such) a UNICODE file. 

I've asked that before a long time ago, see
http://groups.google.de/groups?selm=ku4rzpsb67.fsf%40lasipalatsi.fi&output=gplain.
Back then, in the times of 2.1, two solutions were proposed, neither
of which work:

>>> unicode(open('test.reg').read(), 'utf-8')
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

and

>>> import codecs
>>> encode, decode, reader, writer = codecs.lookup('utf-8')
>>> f = reader(open('test.reg'))
>>> print f.readlines()
[]

OK, after a bit of searching I suspect I might have to go for utf-16,
because that seems (to my limited UNICODE knowledge) like its the
two-byte-codec I was looking for:

>>> encode, decode, reader, writer = codecs.lookup('utf-16')
>>> f = reader(open('test.reg'))
>>> print f.readlines()
[]

Those are unexpected results in my view. I suspect the reason is the
BOM is not handled by those functions. This is a working solution: 

def ReadLinesFromAnything( filename ):
    file = open(filename)
    data = file.read()
    file.close()
    if data[:2] == '\xff\xfe':
        return unicode(data[2:],"utf-16").split("\n")
    else:
        return data.split("\n")

print ReadLinesFromAnything("test.reg")

My take on this is, that the builtin file-readlines() *should* really
know about BOM and return UNICODE strings if the file has a BOM. After
all, if you call readlines() on a file, you expect it to contain lines
of strings (either oldskool or UNICODE). You don't call readlines() on
a binary file in the first place. I volunteer to patch the readlines
function, if some other people out there feel that this is right, too.
(Of course, with my -thanks to c.l.p.- newgained knowledge on using
subclassed file objects, I can always use my own fileclass; so I would
like to argue that at the very least some such fileclass should be
part of the standard python lib).

Task 2: Writing out (such) a unicode file. 

My first foolish attempt:

lines = ReadLinesFromAnything("test.reg")
file = open("test.out","wb")
assert(type(lines[0])==types.UnicodeType)
file.write(lines[0])
file.close()

gives me the first line, all right, but in ASCII, not UNICODE! The
same result for both

file.write(u"test1")

and 

file.write(unicode("test2","utf-8"))

The data is ASCII, not (the-two-byte-kind-of-)UNICODE. Again, this is
an unexpected result. Next, I tried

file.write(unicode("test3","utf-16"))

which raises the following exception:

File "D:\Scripts\2002\02\read-unicode-lines.py", line 20, in ?
    file.write(unicode("test3","utf-16"))
UnicodeError: UTF-16 decoding error: truncated data

when I desperately try this:

data = unicode(lines[0],"utf-16")
file.write(data)

I get the exception

File "D:\Scripts\2002\02\read-unicode-lines.py", line 20, in ?
    data = unicode(lines[0],"utf-16")
TypeError: decoding Unicode is not supported

which a) supports my belief that exceptions suck, and b) is a stupid
error message, because "decoding Unicode is not supported" is simply
not true as a general statement about python. 

At this point, I'm quite frustrated with the joint union of (the way
python handles UNICODE, the UNICODE standard, my knowledge of UNICODE,
and the documentation on this in the Python help). The tutorial has a
very brief section on UNICODE strings, but that is of no help. So I
look up the Python Unicode Tutorial at
http://www.reportlab.com/i18n/python_unicode_tutorial.html

Finally I get an idea that seems to work:

file.write(u"Hello".encode("utf-16"))

The hexdump looks OK, too. But, when I try to write multiple strings,
I run into trouble again, because each string is prefixed with the
BOM, and not the file only:

lines = ReadLinesFromAnything("test.reg")
file = open("test.out","wb")
for line in lines:
    file.write(line.encode("utf-16"))
file.close()

gives me a 0xFFFE for each line, which, to me, is an unexpected result
again. The original REGEDIT generated file had just one at the
beginning of the file. Of course, I can always join the strings and
write them as one, but, to sum up my
possibly-selfrighteous-complaints, I feel that there should be 

- *much* better support for UNICODE Textfiles in python
- a *much* better documentation on this in python.

So now I already feel much better :)