question on string object handling in Python 2.7.8

Wed Dec 24 08:12:23 EST 2014

On 12/23/2014 08:28 PM, Dave Tian wrote:
> Hi,
>

Hi, please do some things when you post new questions:

1) identify your Python version.  In this case it makes a big 
difference, as in Python 2.x, the range function is the only thing that 
takes any noticeable time in this code.

2) when posting code, use cut 'n paste.  You retyped the code, which 
could have caused typos, and in fact did, since your email editor (or 
newsgroup editor, or whatever) decided to use 'smart quotes' instead of 
single quotes.  The Unicode characters shown in "Testing code" below 
include

    LEFT SINGLE QUOTATION MARK
and
    RIGHT SINGLE QUOTATION MARK

which are not valid Python syntax.

> There are 2 statements:
> A: a = ‘h’
> B: b = ‘hh’
>
> According to me understanding, A should be faster as characters would shortcut this 1-byte string ‘h’ without malloc;

Nope, there's no such promise in Python.  If there were such an 
optimization, it might vary between one implementation of Python and 
another, and between one version and the next.

But it'd be very hard to implement such an optimization, since the C 
interface would then see it, and third party native libraries would have 
to have special coding for this one kind of object.

You're probably thinking of Java and C#, which have native data and 
boxed data (I don't recall just what each one calls it).  Python, at 
least for the last 15 years or so, makes everything an object, which 
means there are no special cases for us to deal with.

B should be slower than A as characters does not work for 2-byte string 
‘hh’, which triggers the malloc. However, when I put A/B into a big loop 
and try to measure the performance using cProfile, B seems always faster 
than A.
> Testing code:
> for i in range(0, 100000000):
> 	a = ‘h’ #or b = ‘hh’
> Testing cmd: python -m cProfile test.py
>
> So what is wrong here? B has one more malloc than A but is faster than B?
>

In my testing, sometimes A is quicker, and sometimes B is quicker.  But 
of course there are many ways of testing it, and many versions to test 
it on.  I put those statements (after fixing the quotes) into two 
functions, and called the two functions, letting profile tell me which 
was faster.

Incidentally, just putting them in functions cut the time by 
approximately 50%, probably because local variable lookup in a function 
in much faster in CPython than access to variables in globals().

There are other things going on, In any recent CPython implementation, 
certain strings will be interned, which can both save memory and avoid 
the constant thrashing of malloc and free.  So we might get different 
results by choosing a string which won't happen to get interned.

It's hard to get excited over any of these differences, but it is fun to 
think about it.

-- 
DaveA