Py 3.3, unicode / upper()

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Dec 20 17:51:06 EST 2012


On Thu, 20 Dec 2012 11:40:21 -0800, wxjmfauth wrote:

> I do not care
> about this optimization. I'm not an ascii user. As a non ascii user,
> this optimization is just irrelevant.

WRONG.

Every Python user is an ASCII user. Every Python program has hundreds or 
thousands of ASCII strings.

# === example ===
import random


There's already one ASCII string in your code: the module name "random" 
is ASCII. Let's look inside that module:

py> dir(random)
['BPF', 'LOG4', 'NV_MAGICCONST', 'RECIP_BPF', 'Random', 'SG_MAGICCONST', 
'SystemRandom', 'TWOPI', '_BuiltinMethodType', '_MethodType', 
'_Sequence', '_Set', '__all__', '__builtins__', '__cached__', '__doc__', 
'__file__', '__initializing__', '__loader__', '__name__', '__package__', 
'_acos', '_ceil', '_cos', '_e', '_exp', '_inst', '_log', '_pi', 
'_random', '_sha512', '_sin', '_sqrt', '_test', '_test_generator', 
'_urandom', '_warn', 'betavariate', 'choice', 'expovariate', 
'gammavariate', 'gauss', 'getrandbits', 'getstate', 'lognormvariate', 
'normalvariate', 'paretovariate', 'randint', 'random', 'randrange', 
'sample', 'seed', 'setstate', 'shuffle', 'triangular', 'uniform', 
'vonmisesvariate', 'weibullvariate']


That's another 58 ASCII strings. Let's pick one of those:

py> dir(random.Random)
['VERSION', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', 
'__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', 
'__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', 
'__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', 
'__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', 
'__subclasshook__', '__weakref__', '_randbelow', 'betavariate', 'choice', 
'expovariate', 'gammavariate', 'gauss', 'getrandbits', 'getstate', 
'lognormvariate', 'normalvariate', 'paretovariate', 'randint', 'random', 
'randrange', 'sample', 'seed', 'setstate', 'shuffle', 'triangular', 
'uniform', 'vonmisesvariate', 'weibullvariate']

That's another 51 ASCII strings. Let's pick one of them:

py> dir(random.Random.shuffle)
['__annotations__', '__call__', '__class__', '__closure__', '__code__', 
'__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', 
'__eq__', '__format__', '__ge__', '__get__', '__getattribute__', 
'__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__', 
'__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', 
'__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', 
'__sizeof__', '__str__', '__subclasshook__']

And another 34 ASCII strings.

So to get access to just *one* method of *one* class of *one* module, we 
have already seen up to 144 ASCII strings. (Some of them will be 
duplicated.)

Even if every one of *your* classes, methods, functions, modules and 
variables are using non-ASCII names, you will still use ASCII strings for 
built-in functions and standard library modules.


> What should a Python user think, if he sees his strings are comsuming
> more memory just because he uses non ascii characters

WRONG!

His strings are consuming just as much memory as they need to. You cannot 
fit ten thousand different characters into a single byte. A single byte 
can represent only 2**8 = 256 characters. Two bytes can only represent 
65536 characters at most. Four bytes can represent the entire range of 
every character ever represented in human history, and more, but it is 
terribly wasteful: most strings do not use a billion different 
characters, and so use of a four-byte character encoding uses up to four 
times as much memory as necessary.


You are imagining that non-ASCII users are being discriminated against, 
with their strings being unfairly bloated. But that is not the case. 
Their strings would be equally large in a Python wide-build, give or take 
whatever overhead of the string object that change from version to 
version. If you are not comparing a wide-build of Python to Python 3.3, 
then your comparison is faulty. You are comparing "buggy Unicode, cannot 
handle the supplementary planes" with "fixed Unicode, can handle the 
supplementary planes". Python 3.2 narrow builds save memory by 
introducing bugs into Unicode strings. Python 3.3 fixes those bugs and 
still saves memory.


-- 
Steven



More information about the Python-list mailing list