Unicode conversion

Martin v. Löwis loewis at informatik.hu-berlin.de
Thu Oct 3 13:17:08 EDT 2002


"Edward K. Ream" <edream at tds.net> writes:

> # Tk always uses utf-8 encoding.

You may get that impression, but it is slightly wrong. It is more
reliable if you pass Unicode strings to Tk, instead of UTF-8 encoded
byte strings.

For a byte string, Tk will guess the encoding. If it looks like UTF-8,
it is taken treated UTF-8. Otherwise, it is treated as the locale's
encoding. Unfortunately, if you ever manage to mix the two, you get
byte salad that you can't ever chew. By using Unicode strings to
interface with Tk only, you can avoid those problems.

> print `s`,"tk"
> s = s.encode("utf-8") # result is a string.
> print `s`,"utf-8"
> s = s.decode(xml_encoding) # result is unicode.
> s = s.encode(xml_encoding) # result is a string.

Since xml_encoding is iso-8859-1, you are making a mistake here. You
have UTF-8 data, but you are decoding them as Latin-1. This will
succeed, but it will give an incorrect result. It will succeed since
iso-8859-1 is an single-byte code where every byte value is valid.
That means an arbitrary byte sequence can be interpreted as Latin-1,
but for many byte sequences, the resulting string is non-sense
(mojibake, as the Japanese say).

> BTW, with out the first encode/decode pair I can take exceptions in
> the last encode.

Nevertheless, this is the correct processing. If you have a Unicode
object, as originally obtained from Tk, you should encode as Latin-1
using

s = s.encode("iso-8859-1")

Now, for this specific string (u'a\u0102\xdf\xc9\n'), you will get a
Unicode error. The reason is that one character (\u0102) is not
supported in Latin-1 - this encoding supports only the first 256
Unicode characters.

So, when saving this as XML, the proper representation would be

'a&#x102;\xdf\xc9\n'

i.e. you'll have to use a character reference. Python 2.2 does not
support generating such text very well - you'll have to catch the
Unicode error yourself, find the offending character, encode it as a
character reference, and encode all other characters as requested.

Alternatively, you can refuse encoding a document in a certain
encoding (such as Latin-1), and fall back to UTF-8.

PEP 293 (http://www.python.org/peps/pep-0293.html) will provide a
mechanism to generate character references more conveniently - in
Python 2.3, you can specify

  s.encode('iso-8859-1',errors='xmlcharrefreplace')

HTH,
Martin



More information about the Python-list mailing list