[New-bugs-announce] [issue31484] Cache single-character strings outside of the Latin1 range

Serhiy Storchaka report at bugs.python.org
Fri Sep 15 09:29:39 EDT 2017


New submission from Serhiy Storchaka:

Single-character strings in the Latin1 range (U+0000 - U+00FF) are shared in CPython. This saves memory and CPU time of per-character processing of strings containing ASCII characters and characters from Latin based alphabets. But the users of languages that use non-Latin based alphabets are not so lucky. Proposed PR adds a cache for characters in BMP (U+0100 - U+FFFF) which covers most alphabetic scripts.

Most alphabets contain characters from a single 128- or 256-character block, therefore only lowest bits are used for addressing in the cache. But together with the characters from a particular alphabet it is common to use ASCII characters (spaces, newline, punctuations, digits, etc) and few Unicode punctuation (long dash, Unicode quotes, etc). Their low bits  can match low bits of letters. Therefore every index addresses not a single character, but a mini-LRU-cache of size 2. This keeps letters in a cache even if non-letters with conflicting low bits are occurred.

Microbenchmarking results.

Iterating sample non-Latin-based alphabetic text (Iliad by Homer [1]) is over 70% faster:

$ ./python -m timeit -s 's = open("36248-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 14.5 msec per loop
Patched:    50 loops, best of 5: 8.32 msec per loop

Iterating sample hieroglyphic text (Shui Hu Zhuan by Nai an Shi [2]) is about 4% slower:

$ ./python -m timeit -s 's = open("23863-0.txt").read()' -- 'for c in s: pass'
Unpatched:  20 loops, best of 5: 11.7 msec per loop
Patched:    20 loops, best of 5: 12.1 msec per loop

Iterating a string containing non-repeated characters from the all BMP range is 20% slower:

$ ./python -m timeit -s 's = "".join(map(chr, range(0x10000)))' -- 'for c in s: pass'
Unpatched:  200 loops, best of 5: 1.39 msec per loop
Patched:    200 loops, best of 5: 1.7 msec per loop


[1] https://www.gutenberg.org/files/36248/36248-0.txt
[2] https://www.gutenberg.org/files/23863/23863-0.txt

----------
components: Interpreter Core, Unicode
messages: 302253
nosy: benjamin.peterson, ezio.melotti, haypo, lemburg, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Cache single-character strings outside of the Latin1 range
type: performance
versions: Python 3.7

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue31484>
_______________________________________


More information about the New-bugs-announce mailing list