[ python-Feature Requests-1087418 ] long int bitwise ops speedup (patch included)

Fri Feb 11 04:45:59 CET 2005

Feature Requests item #1087418, was opened at 2004-12-18 00:22
Message generated for change (Comment added) made by gregsmith
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1087418&group_id=5470

Category: None
Group: None
Status: Open
Resolution: None
Priority: 4
Submitted By: Gregory Smith (gregsmith)
Assigned to: Nobody/Anonymous (nobody)
Summary: long int bitwise ops speedup (patch included)

Initial Comment:

The 'inner loop' for applying bitwise ops to longs is quite
inefficient.

The improvement in the attached diff is
 - 'a' is never shorter than 'b' (result: only test 1
   loop index condition instead of 3)
 - each operation ( & | ^ ) has its own loop, instead
   of switch inside loop
- I found that, when this is done, a lot
of things can be simplified, resulting in further speedup,
and the resulting code is not very much longer than
before (my libpython2.4.dll  .text got 140 bytes longer).

Operations on longs of a few thousand bits appear
to be 2 ... 2.5 times faster with this patch.
I'm not 100% sure the code is right, but it passes
test_long.py, anyway.

----------------------------------------------------------------------

>Comment By: Gregory Smith (gregsmith)
Date: 2005-02-10 22:45

Message:
Logged In: YES 
user_id=292741

I started by just factoring out the inner switch loop. But then
it becomes evident that when op = '^', you always have
maska == maskb, so there's no point in doing the ^mask at all.
And when op == '|', then maska==maskb==0. So likewise.
And if you put a check in so that len(a) >= len(b), then the
calculation of len_z can be simplified. It also becomes easy
to break the end off the loops, so that, say, or'ing a small
number with a really long becomes mostly a copy. etc.
It's was just a series of small simple changes following
from the refactoring of the loop/switch. 

I see a repeatable 1.5 x speedup at 300 bits, which
I think is significant (I wasn't using negative #s, which
of course have their own extra overhead). The difference
should be even higher on CPUs that don't have several
100 mW of branch-prediction circuitry.

One use case is that you can simulate an array
of hundreds or thousands of simple 1-bit processors
in pure python using long operations, and get very
good performance, even better with this fix. This app
involves all logical ops, with the occasional shift.

IMHO, I don't think the changed code is more complex; it's a
little longer, but it's more explicit in what is really
being done, and it doesn't roll together 3 cases, which
don't really have that much in common, for the sake of
brevity.  It wasn't obvious to
me about the masks being redundant until after I did the
factoring, and this is my point - rolling it together hides
that.
The original author may not have noticed the redundancy.

 I see a lot of effort being expended on very complex
multiply operations, why should the logical ops be left
behind for
the sake of a few lines?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2005-01-07 01:54

Message:
Logged In: YES 
user_id=80475

Patch Review
------------

On Windows using MSC 6.0, I could only reproduce about a
small speedup at around 300 bits. 

While the patch is short, it adds quite a bit of complexity
to the routine.  Its correctness is not self-evident or
certain.  Even if correct, it is likely to encumber future
maintenance.

Unless you have important use cases and feel strongly about
it, I think this one should probably not go in.

An alternative to submit a patch that limits its scope to
factoring  out the innermost switch/case.  I tried that and
found that the speedup is microscopic.  I suspect that that
one unpredictable branch is not much of a bottleneck.  More
time is likely spent on creating z.

----------------------------------------------------------------------

Comment By: Gregory Smith (gregsmith)
Date: 2005-01-03 14:54

Message:
Logged In: YES 
user_id=292741

I originally timed this on a cygwin system, I've since found
that cygwin timings tend to be strange and possibly
misleading. On a RH8 system, I'm seeing speedup of x3.5 with
longs of ~1500 bits and larger, and x1.5 speedup with only
about 300 bits. Times were measured with timeit.Timer(
'a|b', 'a=...; b=...')
Increase in .text size is likewise about 120 bytes.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1087418&group_id=5470