Flexible string representation, unicode, typography, ...

Fri Aug 24 12:06:25 EDT 2012

On Aug 24, 12:22 am, Ian Kelly <ian.g.ke... at gmail.com> wrote:
> On Thu, Aug 23, 2012 at 12:33 PM,  <wxjmfa... at gmail.com> wrote:
> >> >>> sys.getsizeof('a' * 80 * 50)
>
> >> > 4025
>
> >> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>
> >> > 8040
>
> >>     This example is still benefiting from shrinking the number of bytes
>
> >> in half over using 32 bits per character as was the case with Python 3.2:
>
> >>  >>> sys.getsizeof('a' * 80 * 50)
>
> >> 16032
>
> >>  >>> sys.getsizeof('a' * 80 * 50 + '•')
>
> >> 16036
>
> > Correct, but how many times does it happen?
> > Practically never.
>
> What are you talking about?  Surely it happens the same number of
> times that your example happens, since it's the same example.  By
> dismissing this example as being too infrequent to be of any
> importance, you dismiss the validity of your own example as well.
>
> > In this unicode stuff, I'm fascinated by the obsession
> > to solve a problem which is, due to the nature of
> > Unicode, unsolvable.
>
> > For every optimization algorithm, for every code
> > point range you can optimize, it is always possible
> > to find a case breaking that optimization.
>
> So what?  Similarly, for any generalized data compression algorithm,
> it is possible to engineer inputs for which the "compressed" output is
> as large as or larger than the original input (this is easy to prove).
>  Does this mean that compression algorithms are useless?  I hardly
> think so, as evidenced by the widespread popularity of tools like gzip
> and WinZip.
>
> You seem to be saying that because we cannot pack all Unicode strings
> into 1-byte or 2-byte per character representations, we should just
> give up and force everybody to use maximum-width representations for
> all strings.  That is absurd.
>
> > Sure, it is possible to optimize the unicode usage
> > by not using French characters, punctuation, mathematical
> > symbols, currency symbols, CJK characters...
> > (select undesired characters here:http://www.unicode.org/charts/).
>
> > In that case, why using unicode?
> > (A problematic not specific to Python)
>
> Obviously, it is because I want to have the *ability* to represent all
> those characters in my strings, even if I am not necessarily going to
> take advantage of that ability in every single string that I produce.
> Not all of the strings I use are going to fit into the 1-byte or
> 2-byte per character representation.  Fine, whatever -- that's part of
> the cost of internationalization.  However, *most* of the strings that
> I work with (this entire email message, for instance) -- and, I think,
> most of the strings that any developer works with (identifiers in the
> standard library, for instance) -- will fit into at least the 2-byte
> per character representation.  Why shackle every string everywhere to
> 4 bytes per character when for a majority of them we can do much
> better than that?

Actually what exactly are you (jmf) asking for?
Its not clear to anybody as best as we can see...