How do I display unicode value stored in a string variable using ord()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Aug 18 08:27:23 EDT 2012


On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

>>>> sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
> 
>>>> sys.version
> '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
> bit (Intel)]'
>>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346

"imeit"?

It is hard to take your results seriously when you have so obviously 
edited your timing results, not just copied and pasted them.


Here are my results, on my laptop running Debian Linux. First, testing on 
Python 3.2:

steve at runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 50.2 usec per loop
steve at runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 45.3 usec per loop
steve at runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 51.3 usec per loop
steve at runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 47.6 usec per loop
steve at runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 45.9 usec per loop
steve at runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 57.5 usec per loop
steve at runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 49.7 usec per loop


As you can see, the timing results are all consistently around 50 
microseconds per loop, regardless of which characters I use, whether they 
are in Latin-1 or not. The differences between one test and another are 
not meaningful.


Now I do them again using Python 3.3:

steve at runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 64.3 usec per loop
steve at runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 67.8 usec per loop
steve at runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 66 usec per loop
steve at runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 67.6 usec per loop
steve at runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 68.3 usec per loop
steve at runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 67.9 usec per loop
steve at runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 66.9 usec per loop

The results are all consistently around 67 microseconds. So Python's  
string handling is about 30% slower in the examples show here.

If you can consistently replicate a 100% to 1000% slowdown in string 
handling, please report it as a performance bug:


http://bugs.python.org/

Don't forget to report your operating system.



> My take of the subject.
> 
> This is a typical Python desease. Do not solve a problem, but find a
> way, a workaround, which is expecting to solve a problem and which
> finally solves nothing. As far as I know, to break the "BMP limit", the
> tools are here. They are called utf-8 or ucs-4/utf-32.

The problem with UCS-4 is that every character requires four bytes. 
Every. Single. One.

So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus 
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but 
of course UCS-2 can only represent characters in the BMP. A pure ASCII 
string would only take 11 bytes, but we're not going back to pure ASCII.

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
using two code points. This is fragile and doesn't work very well, 
because string-handling methods can break the surrogate pairs apart, 
leaving you with invalid unicode string. Not good.)

The difference between 44 bytes and 22 bytes for one little string is not 
very important, but when you double the memory required for every single 
string it becomes huge. Remember that every class, function and method 
has a name, which is a string; every attribute and variable has a name, 
all strings; functions and classes have doc strings, all strings. Strings 
are used everywhere in Python, and doubling the memory needed by Python 
means that it will perform worse.

With PEP 393, each Python string will be stored in the most efficient 
format possible:

- if it only contains ASCII characters, it will be stored using 1 byte 
per character;

- if it only contains characters in the BMP, it will be stored using 
UCS-2 (2 bytes per character);

- if it contains non-BMP characters, the string will be stored using 
UCS-4 (4 bytes per character).



-- 
Steven



More information about the Python-list mailing list