Flexible string representation, unicode, typography, ...

Thu Aug 23 14:33:52 EDT 2012

Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
> wxjmfauth at gmail.com:
> 
> 
> 
> > Small illustration. Take an a4 page containing 50 lines of 80 ascii
> 
> > characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
> 
> > and you will see all the optimization efforts destroyed.
> 
> >
> 
> >>> sys.getsizeof('a' * 80 * 50)
> 
> > 4025
> 
> >>>> sys.getsizeof('a' * 80 * 50 + '•')
> 
> > 8040
> 
> 
> 
>     This example is still benefiting from shrinking the number of bytes 
> 
> in half over using 32 bits per character as was the case with Python 3.2:
> 
> 
> 
>  >>> sys.getsizeof('a' * 80 * 50)
> 
> 16032
> 
>  >>> sys.getsizeof('a' * 80 * 50 + '•')
> 
> 16036
> 
Correct, but how many times does it happen?
Practically never.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

jmf