How do I display unicode value stored in a string variable using ord()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 19 02:33:28 EDT 2012


On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:

> The change does not just benefit ASCII users.  It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings 
containing non-BMP characters, then you will see a big benefit.


> Even for narrow build users, there is the benefit that
> with approximately the same amount of memory usage in most cases, they
> no longer have to worry about non-BMP characters sneaking in and
> breaking their code.

Yes! +1000 on that.


> There is some additional benefit for Latin-1 users, but this has nothing
> to do with Python.  If Python is going to have the option of a 1-byte
> representation (and as long as we have the flexible representation, I
> can see no reason not to), 

The PEP explicitly states that it only uses a 1-byte format for ASCII 
strings, not Latin-1:

"ASCII-only Unicode strings will again use only one byte per character"

and later:

"If the maximum character is less than 128, they use the PyASCIIObject 
structure"

and:

"The data and utf8 pointers point to the same memory if the string uses 
only ASCII characters (using only Latin-1 is not sufficient)."


> then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number 
of 1-byte encodings, Latin-1 is hardly the only one.


> because that's what 1-byte Unicode (UCS-1, if you will) is.  If you have
> an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard "1-byte Unicode" 
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too 
many to fit in a single byte. There is some historical justification for 
using "Unicode" to mean UCS-2, but with the standard being extended 
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers 
deliberately matched the Latin-1 standard for Unicode's first 256 code 
points. That's not the same thing though: there is no Unicode standard 
mapping to a single byte format.


-- 
Steven



More information about the Python-list mailing list