[Python-3000] PyUnicodeObject implementation

Tue Sep 9 11:32:37 CEST 2008

Before jumping to conclusions, please read the discussion on the
patch ticket:

    http://bugs.python.org/issue1943

It turned out that the patch only provides a marginal performance
improvement, so the perceived main argument for the PyVarObject
implementation doesn't turn out to be a real advantage.

The reasons for chosing a PyObject approach for Unicode rather than
a PyVarObject one like for strings were the following:

 * a pointer to the actual data makes it possible to implement
   optimizations that share data, e.g. slice objects that a
   parser generates when parsing a larger input string or
   view objects that turn a memory mapped file into a live
   Unicode object without any copying overhead

 * a fixed object size results in making good use of the Python
   allocator, since all objects live in the same pool; as a result
   you have better cache locality - which is good for situations
   where you have to deal with lots of objects

 * objects should be small in order to have lots of them in
   the free lists

 * resizing the object should not result in the object's address
   to change, since this is a common operation when creating
   Unicode objects

 * a fixed size PyObject makes extending the object at C level
   very easy

(probably a few more that I've forgotten - it's been a while
since the days of Python 1.6)

The disadvantages of PyVarObjects w/r to extending them in C
were made rather clear in this thread:

 * finding the extensions requires pointer arithmetic

 * the alignment of the extended parts has to be dealt with
   in the object implementation (rather than having the compiler
   take care of this)

 * when resizing the object's data, the extension parts have to
   be copied and realigned as well

 * when resizing the object's data, the addresses of the extension
   parts change, so code has to be aware of this, e.g. caching of
   the offsets is not easily possible

There are also more general disadvantages:

 * resizing the object can cause a change in the object's address,
   so code has to be aware of this

 * objects are spread over many different pools in the memory
   allocator, reducing cache locality

 * keeping PyVarObjects in the free lists requires more memory

IMHO, it's a lot better to tweak the parameters that we have
in the Unicode implementation (e.g. raise the KEEPALIVE_SIZE_LIMIT
to 32, see the ticket for details) and to improve
the memory allocator for storage of small memory chunks or
improve the free list management (which Antoine did with his
free list patch).

The only valid advantage I see with the PyVarObject patch
is the slightly simplified implementation for the standard
case. Given the number of disadvantages, that did not convince
me to change my -1 on the patch.

Regarding making a PyObject -> PyVarObject change in 3.0.1: that's
not a good idea, since it's not a bug fix, but rather a new feature
that also changes the C API significantly.

Regards,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 09 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

On 2008-09-08 00:55, Guido van Rossum wrote:
> On Sun, Sep 7, 2008 at 2:23 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> Guido van Rossum wrote:
>>> All in all, given the advantage (half the number of allocations) of
>>> the proposal I think there would have to be *very* good arguments
>>> against before we reject this outright. I'd like to understand
>>> Marc-Andre's reasons too.
>> As Stefan notes, because of the frequency with which strings are
>> manipulated in C code via PyString_* / PyUnicode_* calls, it is a data
>> type where "accept no substitutes" prevails.
>>
>> MAL's primary concern appears to be that having Unicode as a plain
>> PyObject leaves the type more open to subclass-based optimisations that
>> have been rejected for the builtin types themselves.
> 
> Hm. I don't have any particularly insightful imagination as to what
> those optimizations might be. Have any been implemented (in 3rd party
> code) in the 8 years that the Unicode object has existed?
> 
>> Having
>> PyString/PyBytes as PyVarObjects means that subclasses are more limited
>> in what they can do.
> 
> True.
> 
>> One possibility that occurs to me is to use a PyVarObject variant that
>> allocates space for an additional void pointer before the variable sized
>> section of the object. The builtin type would leave that pointer NULL,
>> but subtypes could perform the second allocation needed to populate it.
>>
>> The question is whether the 4-8 bytes wasted per object would be worth
>> the fact that only one memory allocation would be needed.
> 
> I believe that 4-8 bytes is more than the overhead of an extra memory
> allocation from the obmalloc heap. It is probably about the same as
> the overhead for a memory allocation from the regular malloc heap. So
> for short strings (of which there are often a lot) it would be more
> expensive; for longer objects it would probably work out just about
> the same.
> 
> There could be a different approach though, whereby the offset from
> the start of the object to the start of the character array wasn't a
> constant but a value stored in the class object. (In fact,
> tp_basicsize could probably be used for this.) It would slow down
> access to the characters a bit though -- a classic time-space
> trade-off that would require careful measurement in order to decide
> which is better.
>