[pypy-dev] PyPy 2 unicode class

Thu Jan 23 21:45:25 CET 2014

On 23/1/2014 10:54 μμ, Steven D'Aprano wrote:
> On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:
>
>> Steven wrote:
>>> With a UTF-8 implementation, won't that mean that string indexing
>>> operations are O(N) rather than O(1)? E.g. how do you know which UTF-8
>>> byte(s) to look at to get the character at index 42 without having to
>>> walk the string from the start?
>>
>> With an indexing cache you do this by looking it up in the cache. [...]
>> So there's a O(N) per-string cost if the string is
>> indexed at least once and an O(k) cost per attempt to index into the string.
>
> Plus a memory cost for the cache, and increased complexity for indexing
> operations.
>
>
>> OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes
>> O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8')
>> then s and b can share the same memory buffer.
>
> CPython caches the UTF-8 encoding of (some? all?) strings too. It's not
> clear to me if they are cached lazily or eagerly, but either way, I'm
> not convinced by this argument that this is useful.
>
> In most applications, you'll decode from UTF-8 exactly once, when you
> read the string in (from a file, or the network, say), and you'll encode
> it at most once, when you write it out again. Many strings won't ever be
> encoded to UTF-8. I have difficulty imagining any application which will
> need to repeatedly encode the same string to UTF-8, where the cost
> of encoding is more significant than the cost of I/O. One possible
> exception might be printing to the terminal, say at the REPL, but even
> then, I don't imagine the time savings are worthwhile. I haven't
> noticed the REPL in Python 3.3 being particularly more responsive than
> that in 3.2. If there's a speed-up, it's invisible to the user, but at
> the cost of using extra memory for the UTF-8 cache.
>
>
>> The FSR does this for latin-1
>> strings but utf-8 strings are more useful. So if you're more likely to
>> decode/encode from/to utf-8 than you are to index into a long string that's a
>> big saving.
>
> But I don't believe that is the case for any realistic use-case. Just
> about every string operation involves walking the string, which
> needs to walk the string in characters (code points), not UTF-8 bytes.
> At the Python layer, string indexing might be relatively uncommon, but
> at the implementation layer, I think every string operation relies on
> it.
>
> But having said all this, I know that using UTF-8 internally for strings
> is quite common (e.g. Haskell does it, without even an index cache, and
> documents that indexing operations can be slow). CPython's FSR has
> received much (in my opinion, entirely uninformed) criticism from one
> vocal person in particular for not using UTF-8 internally. If PyPy goes
> ahead with using UTF-8 internally, I look forward to comparing memory
> and time benchmarks of string operations between CPython and PyPy.
>

I have to admit that due to my work (databases and data processing), i'm 
biased towards I/O (UTF-8 is better due to size) rather than CPU.

At least from my use cases, the most frequent operations that i do on 
strings are read, write, store, use them as keys in dicts, concatenate 
and split.

For most of above things (with the exception of split maybe?), an index 
cache would not be needed, and UTF-8 due to its smaller size would be 
faster than wide unicode encodings.

l.