Article on the future of Python

Wed Sep 26 13:04:50 EDT 2012

On Thu, Sep 27, 2012 at 2:52 AM, Paul Rubin <no.email at nospam.invalid> wrote:
> Chris Angelico <rosuav at gmail.com> writes:
>> When you compare against a wide build, semantics of 3.2 and 3.3 are
>> identical, and then - and ONLY then - can you sanely compare
>> performance. And 3.3 stacks up much better.
>
> I like to have seen real world benchmarks against a pure UTF-8
> implementation.  That means O(n) access to the n'th character of a
> string which could theoretically slow some programs down terribly, but I
> wonder how often that actually matters in ways that can't easily be
> worked around.

That's pretty much what we have with the PHP parts of our web site.
We've decreed that everything should be UTF-8 byte streams (actually,
it took some major campaigning from me to get rid of the underlying
thinking that "binary-safe" and "UTF-8" and "characters" and so on
were all equivalent), but there are very few places where we actually
index strings in PHP. There's a small amount of parsing, but it's all
done by splitting on particular strings - if you search for 0x0A in a
UTF-8 bytestream and split at that index, it's the same as searching
for U+000A in a Unicode string and splitting there - and all of our
structural elements fit inside ASCII. The few times we actually care
about character length (eg limiting user-specified rule names to N
characters), we don't much care about performance, because they're
unusual checks.

So, I don't actually have any stats for you, because it's really easy
to just not index strings at all.

ChrisA