Text to unicode
Martin v. Loewis
martin at v.loewis.de
Thu Oct 3 09:02:35 EDT 2002
Max M <maxm at mxm.dk> writes:
> What I mean is that an editor sees them as text files. But characters
> that really should be one special character are represented as two
> *wrong* characters.
No. If it is UTF-8, then those two are not characters, but bytes, and
they are not wrong, but correct. UTF-8 is a multi-byte encoding; there
is nothing wrong with it. Some characters take up to four bytes in
UTF-8.
> As far as I understand the first two bytes in a unicode file has a
> special value that tells what kind of encoding the text file has.
Not really. Those two bytes are used to indicate UTF-16. In UTF-8, if
a signature is used, it is three bytes.
> My guess is that my files are missing these two bytes so that my
> editor and Python believe it to be a text file.
If, by "Python", you mean the unicode builtin, then no - the UTF-8
codec does not require a signature.
> content = unicode(f.read(), 'utf-8')
> >>> UnicodeError: UTF-8 decoding error: unexpected code byte
That is supposed to work. If it doesn't, it means you have other,
non-UTF-8 data in that file as well, e.g. as a mixed encoding.
If you cannot find out what the error is, I recommend to read the file
line by line, and convert each line with the unicode function. If a
line gives a UnicodeError, print a Python representation of that line,
and post it here.
Regards,
Martin
More information about the Python-list
mailing list