How do I display unicode value stored in a string variable using ord()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 19 02:30:54 EDT 2012


On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:

> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
> 
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and 
so is nearly every character you're ever going to use unless you are 
Asian or a historian using some obscure ancient script. NONE of the 
examples you have shown in your emails have included 4-byte characters, 
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and 
misinterpreting what you have seen.


In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. 
That will not change. There is a tiny amount of fixed overhead for 
strings, and that overhead is slightly different between the versions, 
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text 
that you type is not the same as how Python does it. A text editor is not 
going to be creating a new immutable string after every key press. That 
will be slow slow SLOW. The usual way is to keep a buffer for each 
paragraph, and add and subtract characters from the buffer.


> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to 
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all 
we do is create new strings. First we create a string 'ab…', then we 
create another string 'ab…'*1000, then we create two new strings '…' and 
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just 
immediately create a new one and throw the old one away. You likely do 
work with that string:

steve at runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop

steve at runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of 
deciding whether they should be stored using 1, 2 or 4 bytes begins to 
fade into the noise.


> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable 
slow-down on Windows, report it as a bug.


> Does any body know a way to get the size of the internal "string" in
> bytes? 

sys.getsizeof(some_string)

steve at runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve at runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038


As I said, there is a *tiny* overhead difference. But identifiers will 
generally be smaller:

steve at runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve at runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34

You can check the object overhead by looking at the size of the empty 
string.



-- 
Steven



More information about the Python-list mailing list