inserting Unicode character in dictionary - Python
Joe Strout
joe at strout.net
Sun Oct 19 08:57:43 EDT 2008
On Oct 18, 2008, at 1:20 AM, Martin v. Löwis wrote:
>> Do you then have a proper UTF-8 string,
>> but the problem is that none of the standard Python library methods
>> know
>> how to properly interpret UTF-8?
>
> There is (probably) no such thing as a "proper UTF-8 string" (in the
> sense in which you probably mean it).
To be clear, I mean a string that is valid UTF-8 (not all strings of
bytes are, of course).
> Python doesn't have a data type
> for "UTF-8 string". It only has a data type "byte string". It's up to
> the application whether it gets interpreted in a consistent manner.
> Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
> encoded strings the same way as for, say, Big-5 encoded strings.
Oi -- so if I ask for length, I get the number of bytes, not the
number of characters. If I slice and dice, I could end up splitting
characters in half. It is, as you say, just a string of bytes, not a
string of characters.
>> 4. In Python 3.0, this silliness goes away, because all strings are
>> Unicode by default.
>
> You still need to make sure that the editor's encoding and the
> declared
> encoding match.
Well, the if no encoding is declared, it (quite sensibly) assumes
UTF-8, so for my purposes this boils down to using a UTF-8 editor --
which I always do anyway. But do I still have to put a "u" before my
string literals in order to have it treated as characters rather than
bytes?
I'm hoping that the answer is "no" -- most string literals in a source
file are text (which should be Unicode text, these days); a raw byte
string would be the exceptional case, and I'd be happy to use the "r"
prefix for those.
Best,
- Joe
More information about the Python-list
mailing list