Flexible string representation, unicode, typography, ...

Michael Torrie torriem at gmail.com
Sun Sep 2 15:45:03 EDT 2012


On 09/02/2012 12:58 PM, wxjmfauth at gmail.com wrote:
> My rationale: very simple.
> 
> 1) I never heard about something better than sticking with one
> of the Unicode coding scheme. (genreral theory)
> 2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
> only one guy, who noticed problems. Arguing, "it is fast enough", is not
> a correct answer.

If this is true, why were you holding ho Google Go as an example of
doing it right?  Certainly Google Go doesn't line up with your rational.
 Go has both Strings and Runes.  But strings are UTF-8-encoded bytes
strings and Runes are 32-bit integers.  They are not interchangeable
without a costly encoding and decoding process.  Even worse, indexing a
Go string to get a "Rune" involves some very costly decoding that has to
be done starting at the beginning of the string each time.

In the worst case, Python's strings are as slow as Go because Python
does the exact same thing as Go, but chooses between three encodings
instead of just one.  Best case scenario, Python's strings could be much
faster than Go's because indexing through 2 of the 3 encodings is O(1)
because they are constant-width encodings.  If as you say, the latin-1
subset of UTF-8 is used, then UTF-8 indexing is O(1) too, otherwise it's
probably O(n).





More information about the Python-list mailing list