flaming vs accuracy [was Re: Performance of int/long in Python 3]

Thu Mar 28 09:01:57 EDT 2013

On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:

> Ian Foote:
> 
>> Specifically, indexing a variable-length encoding like utf-8 is not as
>> efficient as indexing a fixed-length encoding.
> 
>     Many common string operations do not require indexing by character
> which reduces the impact of this inefficiency. 

Which common string operations do you have in mind?

Specifically in Python's case, the most obvious counter-example is the 
length of a string. But that's only because Python strings are immutable 
objects, and include a field that records the length. So once the string 
is created, checking its length takes constant time.

Some string operations need to inspect every character, e.g. str.upper(). 
Even for them, the increased complexity of a variable-width encoding 
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or 
4 bytes per character. You have to walk the string grabbing 1 byte at a 
time, and then decide whether you need another 1, 2 or 3 bytes. Even 
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost. 

Any string method that takes a starting offset requires the method to 
walk the string byte-by-byte. I've even seen languages put responsibility 
for dealing with that onto the programmer: the "start offset" is given in 
*bytes*, not characters. I don't remember what language this was... it 
might have been Haskell? Whatever it was, it horrified me.

> UTF-8 seems like a
> reasonable choice for an internal representation to me.

It's not. Haskell, for example, uses UTF-8 internally, and it warns that 
this makes string operations O(N) instead of O(1) precisely because of 
the need to walk the string inspecting every byte.

Remember, when your string primitives are O(N), it is very easy to write 
code that becomes O(N**2). Using UTF-8 internally is just begging for 
user-written code to be O(N**2).

> One benefit of
> UTF-8 over Python's flexible representation is that it is, on average,
> more compact over a wide set of samples.

Sure. And over a different set of samples, it is less compact. If you 
write a lot of Latin-1, Python will use one byte per character, while 
UTF-8 will use two bytes per character.

-- 
Steven