Text to unicode

Thu Oct 3 09:02:35 EDT 2002

Max M <maxm at mxm.dk> writes:

> What I mean is that an editor sees them as text files. But characters
> that really should be one special character are represented as two
> *wrong* characters.

No. If it is UTF-8, then those two are not characters, but bytes, and
they are not wrong, but correct. UTF-8 is a multi-byte encoding; there
is nothing wrong with it. Some characters take up to four bytes in
UTF-8.

> As far as I understand the first two bytes in a unicode file has a
> special value that tells what kind of encoding the text file has. 

Not really. Those two bytes are used to indicate UTF-16. In UTF-8, if
a signature is used, it is three bytes.

> My guess is that my files are missing these two bytes so that my
> editor and Python believe it to be a text file.

If, by "Python", you mean the unicode builtin, then no - the UTF-8
codec does not require a signature.

> content = unicode(f.read(), 'utf-8')
>  >>> UnicodeError: UTF-8 decoding error: unexpected code byte

That is supposed to work. If it doesn't, it means you have other,
non-UTF-8 data in that file as well, e.g. as a mixed encoding.

If you cannot find out what the error is, I recommend to read the file
line by line, and convert each line with the unicode function. If a
line gives a UnicodeError, print a Python representation of that line,
and post it here.

Regards,
Martin