Flexible string representation, unicode, typography, ...

Thu Aug 23 15:34:29 EDT 2012

On 23/08/2012 19:33, wxjmfauth at gmail.com wrote:
> Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
>> wxjmfauth at gmail.com:
>>
>>
>>
>>> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>>
>>> characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),
>>
>>> and you will see all the optimization efforts destroyed.
>>
>>>
>>
>>>>> sys.getsizeof('a' * 80 * 50)
>>
>>> 4025
>>
>>>>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>>> 8040
>>
>>
>>
>>      This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>>   >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>>   >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.
>
> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.
>
> This follows quasi the mathematical logic. To proof a
> law is valid, you have to proof all the cases
> are valid. To proof a law is invalid, just find one
> case showing it.
>
> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)
>
> jmf
>

What do you propose should be used instead, as you appear to be the 
resident expert in the field?

-- 
Cheers.

Mark Lawrence.