How do I display unicode value stored in a string variable using ord()

Chris Angelico rosuav at gmail.com
Mon Aug 20 00:07:39 EDT 2012


On Mon, Aug 20, 2012 at 10:35 AM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 8/19/2012 6:42 PM, Chris Angelico wrote:
>> However, Python goes a bit further by making it VERY clear that this
>> is a mere optimization, and that Unicode strings and bytes strings are
>> completely different beasts. In Pike, it's possible to forget to
>> encode something before (say) writing it to a socket. Everything works
>> fine while you have only ASCII characters in the string, and then
>> breaks when you have a >255 codepoint - or perhaps worse, when you
>> have a 127<x<256, and the other end misinterprets it.
>
> Python writes strings to file objects, including open sockets, without
> creating a bytes object -- IF the file is opened in text mode, which always
> has an associated encoding, even if the default 'ascii'. From what you say,
> this is what Pike is missing.

In text mode, the library does the encoding, but an encoding still happens.

> I am pretty sure that the obvious optimization has already been done. The
> internal bytes of all-ascii text can safely be sent to a file with ascii (or
> ascii-compatible) encoding without intermediate 'decoding'. I remember
> several patches of that sort. If a string is internally ucs2 and the file is
> declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly
> (possibly with a byte swap).

Maybe it doesn't take any memory change, but there is a data type
change. A Unicode string cannot be sent over the network; an encoding
is needed.

In Pike, I can take a string like "\x20AC" (or "\u20ac" or
"\U000020ac", same thing) and manipulate it as a one-character string,
but I cannot write it to a file or file-like object. I can, however,
pass it through a codec (and there's string_to_utf8() for the
convenience of the common case), and get back something like
"\xe2\x82\xac", which is a three-byte string. The thing is, though,
that this new string is of exactly the same data type as the original:
'string'. Which means that I could have a string containing Latin-1
but not ASCII characters, and Pike will happily write it to a socket
without raising a compile-time or run-time error. Python, under the
same circumstances, would either raise an error or quietly (and
correctly) encode the data.

But this is a relatively trivial point, in the scheme of things.
Python has an excellent model now for handling Unicode strings, and I
would STRONGLY recommend everyone to upgrade to 3.3.

ChrisA



More information about the Python-list mailing list