inserting Unicode character in dictionary - Python

"Martin v. Löwis" martin at v.loewis.de
Sat Oct 18 03:20:37 EDT 2008


> 2. Exactly what Unicode you get would be dependent on Python properly
> interpreting the bytes in the source file -- which you can make it do by
> adding something like "-*- coding: utf-8 -*-" in a comment at the top of
> the file.

That depends on the Python version. Up to (and including) 2.4, the bytes
on the disk where interpreted as Latin-1 in absence of an encoding
declaration. In 2.5, not having an encoding declaration is an error. In
3.x, in absence of an encoding declaration, the bytes are interpreted as
UTF-8 (giving an error when ill-formed UTF-8 sequences are encountered).

> 3. Without the "u" prefix, you'll have some 8-bit string, whose
> interpretation is... er... here's where I get a bit fuzzy.  What if your
> source file is set to utf-8?

You need to distinguish between the declared encoding, and the intended
(editor) encoding also. Some editors (like Emacs or IDLE) interpret the
declaration, others may not. What you see on the display is the editor's
interpretation; what Python uses is the declared encoding.

However, Python uses the declared encoding just for Unicode strings.

> Do you then have a proper UTF-8 string,
> but the problem is that none of the standard Python library methods know
> how to properly interpret UTF-8?

There is (probably) no such thing as a "proper UTF-8 string" (in the
sense in which you probably mean it). Python doesn't have a data type
for "UTF-8 string". It only has a data type "byte string". It's up to
the application whether it gets interpreted in a consistent manner.
Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
encoded strings the same way as for, say, Big-5 encoded strings.

> 4. In Python 3.0, this silliness goes away, because all strings are
> Unicode by default.

You still need to make sure that the editor's encoding and the declared
encoding match.

Regards,
Martin



More information about the Python-list mailing list