Degree symbol (UTF-8 > ASCII)

Martin v. Löwis martin at v.loewis.de
Fri Apr 18 04:45:28 EDT 2003


pc451 at yahoo.com (Peter Clark) writes:

> > And scale is a Unicode string, right?
>     Only because the XML document has no specified encoding, so it
> defaults to UTF-8, yes. But all the text is straight ASCII, except of
> course for the inclusion of the degree symbol.

No. UTF-8 is *not* Unicode. In Python, there are two data types: <type
'string'>, and <type 'unicode'>. The type string represents bytes (8
bit per element), and the type unicode represents characters.

The string type can also be used to represent characters, but only if
you assume that you are using some encoding. UTF-8 is an encoding, and
so is Latin-1. A string encoded in UTF-8 is still a byte string, not a
character string. A Unicode object may contain characters that can be
encoded in ASCII, or it can contain characters that cannot be encoded
in ASCII.

If this is not the mental model that you have, you will have a hard
time understanding all the phenomenons you observe, and I suggest
reading

http://manatee.mojam.com/~skip/unicode/unicode/

Regards,
Martin




More information about the Python-list mailing list