[Python-Dev] PEP 393 close to pronouncement

Thu Sep 29 02:27:48 CEST 2011

> Resizing
> --------
> 
> Codecs use resizing a lot. Given that PyCompactUnicodeObject
> does not support resizing, most decoders will have to use
> PyUnicodeObject and thus not benefit from the memory footprint
> advantages of e.g. PyASCIIObject.

Wrong. Even if you create a string using the legacy API (e.g. 
PyUnicode_FromUnicode), the string will be quickly compacted to use the most 
efficient memory storage (depending on the maximum character). "quickly": at the 
first call to PyUnicode_READY. Python tries to make all strings ready as early 
as possible.

> PyASCIIObject has a wchar_t *wstr pointer - I guess this should
> be a char *str pointer, otherwise, where's the memory footprint
> advantage (esp. on Linux where sizeof(wchar_t) == 4) ?

For pure ASCII strings, you don't have to store a pointer to the UTF-8 string, 
nor the length of the UTF-8 string (in bytes), nor the length of the wchar_t 
string (in wide characters): the length is always the length of the "ASCII" 
string, and the UTF-8 string is shared with the ASCII string. The structure is 
much smaller thanks to these optimizations, and so Python 3.3 uses less memory 
than 2.7 for ASCII strings, even for short strings.

> I also don't see a reason to limit the UCS1 storage version
> to ASCII. Accordingly, the object should be called PyLatin1Object
> or PyUCS1Object.

Latin1 is less interesting, you cannot share length/data fields with utf8 or 
wstr. We didn't add a special case for Latin1 strings (except using Py_UCS1* 
strings to store their characters).

> Furthermore, determining len(obj) will require a loop over
> the data, checking for surrogate code points. A simple memcpy()
> is no longer enough.

Wrong. len(obj) gives the "right" result (see the long discussion about what 
is the length of a string in a previous thread...) in O(1) since it's computed 
when the string is created.

> ... in practice you only
> very rarely see any non-BMP code points in your data. Making
> all Python users pay for the needs of a tiny fraction is
> not really fair. Remember: practicality beats purity.

The creation of the string is maybe is little bit slower (especially when you 
have to scan the string twice to first get the maximum character), but I think 
that this slow down is smaller than the speedup allowed by the PEP.

Because ASCII strings are now char*, I think that processing ASCII strings is 
faster because the CPU can cache more data (close to the CPU).

We can do better optimization on ASCII and Latin1 strings (it's faster to 
manipulate char* than uint16_t* or uint32_t*). For example, str.center(), 
str.ljust, str.rjust and str.zfill do now use the very fast memset() function 
for latin1 strings to pad the string.

Another example, duplicating a string (or create a substring) should be faster 
just because you have less data to copy (e.g. 10 bytes for a string of 10 
Latin1 characters vs 20 or 40 bytes with Python 3.2).

The two most common encodings in the world are ASCII and UTF-8. With the PEP 
393, encoding to ASCII or UTF-8 is free, you don't have to encode anything, 
you have directly the encoded char* buffer (whereas you have to convert 16/32 
bit wchar_t to char* in Python 3.2, even for pure ASCII). (It's also free to 
encode "Latin1" Unicode string to Latin1.)

With the PEP 393, we never have to decode UTF-16 anymore when iterating on 
code pointer to support correctly non-BMP characters (which was required 
before in narrow build, e.g. on Windows). Iterate on code point is just a 
dummy loop, no need to check if each character is in range U+D800-U+DFFF.

There are other funny tricks (optimizations). For example, text.replace(a, b) 
knows that there is nothing to do if maxchar(a) > maxchar(text), where 
maxchar(obj) just requires to read an attribute of the string. Think about 
ASCII and non-ASCII strings: pure_ascii.replace('\xe9', '') now just creates a 
new reference...

I don't think that Martin wrote his PEP to be able to implement all these 
optimisations, but there are an interesting side effect of his PEP :-)

> The table only lists string sizes up 8 code points. The memory
> savings for these are really only significant for ASCII
> strings on 64-bit platforms, if you use the default UCS2
> Python build as basis.

In the 32 different cases, the PEP 393 is better in 29 cases and "just" as good 
as Python 3.2 in 3 corner cases:

- 1 ASCII, 16-bit wchar, 32-bit
- 1 Latin1, 32-bit wchar, 32-bit
- 2 Latin1, 32-bit wchar, 32-bit

Do you really care of these corner cases? See the more the realistic benchmark 
in previous Martin's email ("PEP 393 memory savings update"): the PEP 393 not 
only uses 3x less memory than 3.2, but it uses also *less* memory than Python 
2.7, whereas Python 3 uses Unicode for everything!

> For larger strings, I expect the savings to be more significant.

Sure.

> OTOH, a single non-BMP code point in such a string would cause
> the savings to drop significantly again.

In this case, it's just as good as Python 3.2 in wide mode, but worse than 3.2 
in narrow mode. But is it a real use case?

If you want a really efficient storage for heterogeneous strings (mixing ASCII, 
Latin1, BMP and non-BMP), you can split the text into chunks. For example, I 
hope that a text processor like LibreOffice doesn't store all paragraphs in the 
same string, but create at least a string per paragraph. If you use short 
chunks, you will not notice the difference in memory footprint when you insert 
a non-BMP character. The trick doesn't work on Python < 3.3.

> For best performance, each algorithm will have to be implemented
> for all three storage types. ...

Good performances can be archived using PyUnicode macros like PyUnicode_READ 
and PyUnicode_WRITE. But yes, if you want a super-fast Unicode processor, you 
can special case some kinds (UCS1, UCS2, UCS4), like the examples I described 
before (use memset for latin1).

> ... Not doing so, will result in a slow-down, if I read the PEP
> correctly.

I don't think so. Browse the new unicodeobject.c, there are few switch/case on 
the kind (if you ignore the low-level functions like _PyUnicode_Ready). For 
example, unicode_isalpha() has only one implementation, using PyUnicode_READ. 
PyUnicode_READ doesn't use a switch but classic (fast) arithmetic on pointers.

> It's difficult to say, of what scale, since that
> information is not given in the PEP, but the added loop over
> the complete data array in order to determine the maximum
> code point value suggests that it is significant.

Feel free to run yourself Antoine's benchmarks like stringbench and iobench, 
they do micro-benchmarks. But you have to know that very few codecs use the 
new Unicode API (I think that only UTF-8 encoder and decoder use the new API, 
maybe also the ASCII codec).

> I am not convinced that the memory savings are big enough
> to warrant the performance penalty and added complexity
> suggested by the PEP.

I didn't run any benchmark, but I don't think that the PEP 393 makes Python 
slower. I expect a minor speedup in some corner cases :-) I prefer to wait 
until all modules are converted to the new API to run benchmarks. TODO: 
unicodedata, _csv, all codecs (especially error handlers), ...

> In practice, using a UCS2 build of Python usually is a good
> compromise between memory savings, performance and standards
> compatibility

About "standards compatibility", the work to support non-BMP characters 
everywhere was not finished in Python 3.2, 11 years after the introduction of 
Unicode in Python (2.0). Using the new API, non-BMP characters will be 
supported for free, everywhere (especially in *Python*, "\U0010FFFF"[0] and 
len("\U0010FFFF") doesn't give surprising results anymore).

With the addition of emoticon in a non-BMP range in Unicode 6, non-BMP 
characters will become more and more common. Who doesn't like emoticon? :-) 
o;-) >< (no, I will no add non-BMP characters in this email, I don't want to 
crash your SMTP server and mail client)

> IMHO, Python should be optimized for UCS2 usage

With the PEP 393, it's better: Python is optimize for any usage! (but I expect 
it to be faster in the Latin1 range, U+0000-U+00FF)

> I do see the advantage for large strings, though.

A friend reads last Martin's benchmark differently: Python 3.2 uses 3x more 
memory than Python 2! Can I say that the PEP 393 fixed an huge regression of 
Python 3?

> Given that I've been working on and maintaining the Python Unicode
> implementation actively or by providing assistance for almost
> 12 years now, I've also thought about whether it's still worth
> the effort.

Thanks for your huge work on Unicode, Marc-Andre!

> My interests have shifted somewhat into other directions and
> I feel that helping Python reach world domination in other ways
> makes me happier than fighting over Unicode standards, implementations,
> special cases that aren't special enough, and all those other
> nitty-gritty details that cause long discussions :-)

Someone said that we still need to define what a character is! By the way, what 
is a code point?

> So I feel that the PEP 393 change is a good time to draw a line
> and leave Unicode maintenance to Ezio, Victor, Martin, and
> all the others that have helped over the years. I know it's
> in good hands.

I don't understand why you would like to stop contribution to Unicode, but 
well, as you want. We will try to continue your work.

Victor