RE Module Performance

Wed Jul 31 04:32:13 EDT 2013

FSR:
===

The 'a' in 'a€' and 'a\U0001d11e:

>>> ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')]
['0b00000000', '0b01100001', '0b00100000', '0b10101100']
>>> ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]
['0b00000000', '0b00000000', '0b00000000', '0b01100001',
'0b00000000', '0b00000001', '0b11010001', '0b00011110']

Has to be done.

sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27

Unicode/utf*
============

i) ("primary key") Create and use a unique set of encoded
code points.
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32

Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.
iv) An "a" size never exceeds 4 bytes.

Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

jmf