Flexible string representation, unicode, typography, ...

Thu Aug 23 15:22:16 EDT 2012

On Thu, Aug 23, 2012 at 12:33 PM,  <wxjmfauth at gmail.com> wrote:
>> >>> sys.getsizeof('a' * 80 * 50)
>>
>> > 4025
>>
>> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> > 8040
>>
>>
>>
>>     This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>>  >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>>  >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.

What are you talking about?  Surely it happens the same number of
times that your example happens, since it's the same example.  By
dismissing this example as being too infrequent to be of any
importance, you dismiss the validity of your own example as well.

> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.

So what?  Similarly, for any generalized data compression algorithm,
it is possible to engineer inputs for which the "compressed" output is
as large as or larger than the original input (this is easy to prove).
 Does this mean that compression algorithms are useless?  I hardly
think so, as evidenced by the widespread popularity of tools like gzip
and WinZip.

You seem to be saying that because we cannot pack all Unicode strings
into 1-byte or 2-byte per character representations, we should just
give up and force everybody to use maximum-width representations for
all strings.  That is absurd.

> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)

Obviously, it is because I want to have the *ability* to represent all
those characters in my strings, even if I am not necessarily going to
take advantage of that ability in every single string that I produce.
Not all of the strings I use are going to fit into the 1-byte or
2-byte per character representation.  Fine, whatever -- that's part of
the cost of internationalization.  However, *most* of the strings that
I work with (this entire email message, for instance) -- and, I think,
most of the strings that any developer works with (identifiers in the
standard library, for instance) -- will fit into at least the 2-byte
per character representation.  Why shackle every string everywhere to
4 bytes per character when for a majority of them we can do much
better than that?