[Python-Dev] Internal representation of strings and Micropython

Glenn Linderman v+python at g.nevcal.com
Wed Jun 4 23:57:36 CEST 2014


On 6/4/2014 2:28 PM, Chris Angelico wrote:
> On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:
>> 8) (Content specific variable size caches)  Index each codepoint that is a
>> different byte size than the previous codepoint, allowing indexing to be
>> used in the intervals. Worst case size is like 2, best case size is a single
>> entry for the end, when all code points are represented by the same number
>> of bytes.
> Conceptually interesting, and I'd love to know how well that'd perform
> in real-world usage.

So would I :)

> Would do very nicely on blocks of text that are
> all from the same range of codepoints, but if you intersperse high and
> low codepoints it'll be like 2 but with significantly more complicated
> lookups (imagine a "name=value\nname=value\n" stream where the names
> and values are all in the same language - you'll have a lot of
> transitions).

Lookup is binary search on code point index or a search for same in some 
tree structure, I would think.

"like 2 but ..." well, the data structure would be bigger than for 2, 
but your example shows 4-5 high codepoints per low codepoint (for some 
languages).

I did just think of another refinement to this technique (my list was 
not intended to be all-inclusive... just a bunch of variations I thought 
of then).

10) (Content specific variable size caches) Like 8, but the last 
character in a run is allowed (but not required) to be a different 
number of bytes than prior characters, because the offset calculation 
will still work for the first character of a different size.

So #10 would halve the size of your imagined stream that intersperses 
one low-byte charater with each sequence of high-byte characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140604/e70449bc/attachment.html>


More information about the Python-Dev mailing list