How do I display unicode value stored in a string variable using ord()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Aug 17 23:59:47 EDT 2012


On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote:

> Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
>> On Fri, Aug 17, 2012 at 1:49 PM,  <wxjmfauth at gmail.com> wrote:
>> 
>> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>> > is one of these characters existing in the cp1252, mac-roman
>> > coding schemes and not in iso-8859-1 (latin-1) and obviously
>> > not in ascii. It causes Py3.3 to work a few 100% slower
>> > than Py<3.3 versions due to the flexible string representation
>> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
[...]
> Sorry, you missed the point.
> 
> My comment had nothing to do with the code source coding, the coding of
> a Python "string" in the code source or with the display of a Python3
> <str>.
> I wrote about the *internal* Python "coding", the way Python keeps
> "strings" in memory. See PEP 393.


The PEP does not support your claim that flexible string storage is 100% 
to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60% 
of the memory used for strings.

I don't really understand what message you are trying to give here. Are 
you saying that PEP 393 is a good thing or a bad thing?

In Python 1.x, there was no support for Unicode at all. You could only 
work with pure byte strings. Support for non-ascii characters like … ∞ é ñ
£ π Ж ش was purely by accident -- if your terminal happened to be set to 
an encoding that supported a character, and you happened to use the 
appropriate byte value, you might see the character you wanted.

In Python 2.2, Python gained support for Unicode. You could now guarantee 
support for any Unicode character in the Basic Multilingual Plane (BMP) 
by writing your strings using the u"..." style. In Python 3, you no 
longer need the leading U, all strings are unicode.

But there is a problem: if your Python interpreter is a "narrow build", 
it *only* supports Unicode characters in the BMP. When Python is a "wide 
build", compiled with support for the additional character planes, then 
strings take much more memory, even if they are in the BMP, or are simple 
ASCII strings.

PEP 393 fixes this problem and gets rid of the distinction between narrow 
and wide builds. From Python 3.3 onwards, all Python compilers will have 
the same support for unicode, rather than most being BMP-only. Each 
individual string's internal storage will use only as many bytes-per-
character as needed to store the largest character in the string.

This will save a lot of memory for those using mostly ASCII or Latin-1 
but a few multibyte characters. While the increased complexity causes a 
small slowdown, the increased functionality makes it well worthwhile.



-- 
Steven



More information about the Python-list mailing list