Flexible string representation, unicode, typography, ...

Tue Aug 28 22:42:25 EDT 2012

On Aug 28, 4:57 am, Neil Hodgson <nhodg... at iinet.net.au> wrote:
> wxjmfa... at gmail.com:
>
> > Go "has" the integers int32 and int64. A rune ensure
> > the usage of int32. "Text libs" use runes. Go has only
> > bytes and runes.
>
>      Go's text libraries use UTF-8 encoded byte strings. Not arrays of
> runes. See, for example,http://golang.org/pkg/regexp/
>
>     Are you claiming that UTF-8 is the optimum string representation and
> therefore should be used by Python?
>
>     Neil

This whole rune/go business is a red-herring.
In the other thread Peter Otten wrote:

> wxjmfa... at gmail.com wrote:
> > By chance and luckily, first attempt.
> > c:\python32\python -m timeit "('€'*100+'€'*100).replace('€'
> > , 'œ')"
> > 1000000 loops, best of 3: 1.48 usec per loop
> > c:\python33\python -m timeit "('€'*100+'€'*100).replace('€'
> > , 'œ')"
> > 100000 loops, best of 3: 7.62 usec per loop
>
> OK, that is roughly factor 5. Let's see what I get:
>
> $ python3.2 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
> 100000 loops, best of 3: 1.8 usec per loop
> $ python3.3 -m timeit '("€"*100+"€"*100).replace("€", "œ")'
> 10000 loops, best of 3: 9.11 usec per loop
>
> That is factor 5, too. So I can replicate your measurement on an AMD64 Linux
> system with self-built 3.3 versus system 3.2.
>
> > Note
> > The used characters are not members of the latin-1 coding
> > scheme (btw an *unusable* coding).
> > They are however charaters in cp1252 and mac-roman.
>
> You seem to imply that the slowdown is connected to the inability of latin-1
> to encode "œ" and "€" (to take the examples relevant to the above
> microbench). So let's repeat with latin-1 characters:
>
> $ python3.2 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
> 100000 loops, best of 3: 1.76 usec per loop
> $ python3.3 -m timeit '("ä"*100+"ä"*100).replace("ä", "ß")'
> 10000 loops, best of 3: 10.3 usec per loop
>
> Hm, the slowdown is even a tad bigger. So we can safely dismiss your theory
> that an unfortunate choice of the 8 bit encoding is causing it. Do you

In summary:
1. The problem is not on jmf's computer
2. It is not windows-only
3. It is not directly related to latin-1 encodable or not

The only question which is not yet clear is this:
Given a typical string operation that is complexity O(n), in more
detail it is going to be O(a + bn)
If only a is worse going 3.2 to 3.3, it may be a small issue.
If b is worse by even a tiny amount, it is likely to be a significant
regression for some use-cases.

So doing some arm-chair thinking (I dont know the code and difficulty
involved):

Clearly there are 3 string-engines in the python 3 world:
- 3.2 narrow
- 3.2 wide
- 3.3 (flexible)

How difficult would it be to giving the choice of string engine as a
command-line flag?
This would avoid the nuisance of having two binaries -- narrow and
wide.
And it would give the python programmer a choice of efficiency
profiles.