RE Module Performance

Fri Jul 12 19:13:47 EDT 2013

On 07/12/2013 09:59 AM, Joshua Landau wrote:
> If you're interested, the basic of it is that strings now use a
> variable number of bytes to encode their values depending on whether
> values outside of the ASCII range and some other range are used, as an
> optimisation.

Variable number of bytes is a problematic way to saying it.  UTF-8 is a
variable-number-of-bytes encoding scheme where each character can be 1,
2, 4, or more bytes, depending on the unicode character.  As you can
imagine this sort of encoding scheme would be very slow to do slicing
with (looking up a character at a certain position).  Python uses
fixed-width encoding schemes, so they preserve the O(n) lookup speeds,
but python will use 1, 2, or 4 bytes per every character in the string,
depending on what is needed.  Just in case the OP might have
misunderstood what you are saying.

jmf sees the case where a string is promoted from one width to another,
and thinks that the brief slowdown in string operations to accomplish
this is a problem.  In reality I have never seen anyone use the types of
string operations his pseudo benchmarks use, and in general Python 3's
string behavior is pretty fast.  And apparently much more correct than
if jmf's ideas of unicode were implemented.