Grapheme clusters, a.k.a.real characters

Chris Angelico rosuav at gmail.com
Thu Jul 20 14:02:09 EDT 2017


On Fri, Jul 21, 2017 at 2:10 AM, Random832 <random832 at fastmail.com> wrote:
> On Thu, Jul 20, 2017, at 01:15, Steven D'Aprano wrote:
>> I haven't really been paying attention to Marko's suggestion in detail,
>> but if we're talking about a whole new data type, how about a list of
>> nodes, where each node's data is a decomposed string object guaranteed to
>> be either:
>
> How about each node but the last has a fixed "length" (say, 16
> characters), and random access below that size is done by indexing to
> the node level and then walking forward.
>
> I've thought about this in the past for encoding strings in UTF-8 with
> O(1) random code point access.

You would have to benchmark it thoroughly. Don't forget that
allocating a large string would now require a number of memory
allocations (which are slow), and that cache locality is a huge source
of hidden performance variation.

Big O is far from the whole story. "But it's happening in constant
time!" is meaningless if the constant is too high.

ChrisA



More information about the Python-list mailing list