How do I display unicode value stored in a string variable using ord()

Terry Reedy tjreedy at udel.edu
Sun Aug 19 20:35:30 EDT 2012


On 8/19/2012 6:42 PM, Chris Angelico wrote:
> On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy <tjreedy at udel.edu> wrote:

>> Python has often copied or borrowed, with adjustments. This time it is the
>> first.

I should have added 'that I know of' ;-)

> Maybe it wasn't consciously borrowed, but whatever innovation is done,
> there's usually an obscure beardless language that did it earlier. :)
>
> Pike has a single string type, which can use the full Unicode range.
> If all codepoints are <256, the string width is 8 (measured in bits);
> if <65536, width is 16; otherwise 32. Using the inbuilt count_memory
> function (similar to the Python function used somewhere earlier in
> this thread, but which I can't at present put my finger to), I find
> that for strings of 16 bytes or more, there's a fixed 20-byte header
> plus the string content, stored in the correct number of bytes. (Pike
> strings, like Python ones, are immutable and do not need expansion
> room.)

It is even possible that someone involved was even vaguely aware that 
there was an antecedent. The PEP makes no claim that I can see, but lays 
out the problem and goes right to details of a Python implementation.

> However, Python goes a bit further by making it VERY clear that this
> is a mere optimization, and that Unicode strings and bytes strings are
> completely different beasts. In Pike, it's possible to forget to
> encode something before (say) writing it to a socket. Everything works
> fine while you have only ASCII characters in the string, and then
> breaks when you have a >255 codepoint - or perhaps worse, when you
> have a 127<x<256, and the other end misinterprets it.

Python writes strings to file objects, including open sockets, without 
creating a bytes object -- IF the file is opened in text mode, which 
always has an associated encoding, even if the default 'ascii'. From 
what you say, this is what Pike is missing.

I am pretty sure that the obvious optimization has already been done. 
The internal bytes of all-ascii text can safely be sent to a file with 
ascii (or ascii-compatible) encoding without intermediate 'decoding'. I 
remember several patches of that sort. If a string is internally ucs2 
and the file is declared usc2 or utf-16 encoding, then again, pairs of 
bytes can go directly (possibly with a byte swap).


-- 
Terry Jan Reedy




More information about the Python-list mailing list