How do I display unicode value stored in a string variable using ord()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Aug 18 00:10:30 EDT 2012


On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote:

> On 08/17/2012 08:21 PM, Ian Kelly wrote:
>> On Aug 17, 2012 2:58 PM, "Dave Angel" <d at davea.name> wrote:
>>> The internal coding described in PEP 393 has nothing to do with
>>> latin-1 encoding.
>> It certainly does. PEP 393 provides for Unicode strings to be
>> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is
>> smallest and sufficient to contain the data. 

Unicode strings are not represented as Latin-1 internally. Latin-1 is a 
byte encoding, not a unicode internal format. Perhaps you mean to say 
that they are represented as a single byte format?

>> I understand the complaint
>> to be that while the change is great for strings that happen to fit in
>> Latin-1, it is less efficient than previous versions for strings that
>> do not.
> 
> That's not the way I interpreted the PEP 393.  It takes a pure unicode
> string, finds the largest code point in that string, and chooses 1, 2 or
> 4 bytes for every character, based on how many bits it'd take for that
> largest code point.

That's how I interpret it too.


> Further i read it to mean that only 00 bytes would
> be dropped in the process, no other bytes would be changed.

Just to clarify, you aren't talking about the \0 character, but only to 
extraneous "padding" 00 bytes.


> I also figure this is going to be more space efficient than Python 3.2
> for any string which had a max code point of 65535 or less (in Windows),
> or 4billion or less (in real systems).  So unless French has code points
> over 64k, I can't figure that anything is lost.

I think that on narrow builds, it won't make terribly much difference. 
The big savings are for wide builds.


-- 
Steven



More information about the Python-list mailing list