speeding up string.split()

Fri May 25 09:48:08 EDT 2001

Carlos Ribeiro <cribeiro at mail.inet.com.br> wrote in
news:mailman.990792040.31938.python-list at python.org: 

>>Speedups to the above code:
>>1. The variable array is not used after it is assigned, and the
>>assignment is constant. factor the assignment out of the loop.
>>2. After 1, the loop is empty, remove the loop.
>>1 and 2 together provide a massive speed improvement with no loss of
>>functionality to the code as given.
> 
> It seems to me that this is an "artificial" loop, done for the sole
> reason to allow timing measurement. This is not slowing things down
> because what matters is the time per iteration.
> 
I probably didn't make my point clearly enough. For a loop such as the one 
given to appear in real code would require that the lines come from 
somewhere, and that you do something with the data after splitting them 
(otherwise what is the point). Given that, there is little point in doing 
micro optimisations on the example loop if there is a chance that a major 
change to the algorithm used could have a much more significant effect. 
Without further information though, we cannot tell what optimisations are 
worthwhile.

If the real code doesn't loop at all, then there is little point doing any 
optimisations. Even as written the code takes only 10 seconds on my laptop. 
The original poster said he would use map in the real program, so I assume 
that the code is in fact being repeated many times.

>>Alternatively:
>>3. Put the code inside a function.
> 
> This one deserves some explanation. There are some little optimizations
> than Python does inside functions to access local variables. However,
> it seems that this is not the issue here.
Simply putting the code into a function with no other changes on my machine 
gives an 11% speed improvement. Your mileage may vary.

> 
>>4. Use the split method on the string instead of the split function
> 
> Hummm. Again the same; it does not seem to make much difference.
Another 11% on my machine. 3 and 4 together make 22% difference.

> 
>>5. Use string concatenation instead of '+'
>>3, 4 and 5 together knock about 25% off the running time.
> 
> I think that (5) alone causes most of the difference, because + is 
> evaluated at runtime.
Did you try timing it? (5) makes a much smaller difference than either (3) 
or (4).

>>6. If whatever you intend to do with the data involves filtering it on
>>the first field or two, then using "xxx...".split(' ', 1) is very much
>>faster than splitting up all the fields. This can reduce the time by
>>two thirds easily. 
> 
> Good hint. This also makes difference - don't do all the work if you
> really need only part of it.
Which brings us back to my original point that we need far more information 
about the real use for the data rather than spending time on a loop that 
has no actual effect.

> 
>>7. Use Perl, or C, or whatever else takes your fancy if speed is that
>>critical. 
> 
> You could also try to use the re module. This has *several* advantages.
> The code is highly optimized and is Unicode aware. You can in a single
> step *both* break the string and check if the parts are valid, so
> anything invalid is automatically detected. 
Balanced against these advantages: the code is *much* less readable (always 
do the simplest, most readable thing first, then optimise iff you have to);
the code is also substantially slower.
The point about validation is good although the original string doesn't 
match your pattern.

FWIW, the code I used (subject to the vagaries of line wrapping) is:
---------------------------
import string
from string import split
start = time.clock()
for i in range(300000):
    array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' +  
                  '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
stend = time.clock()
splittime = stend-start
print "split  :", stend-start

def split1():
    start = time.clock()
    for i in range(300000):
        array = ('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' 
                      '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 
2').split()
    stend = time.clock()

    print "split1 :", stend-start, "%2.0f%%" % ((1-(stend-
start)/splittime)*100)

split1()

def split2():
    start = time.clock()
    for i in range(300000):
        array = ('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' 
                      '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 
2').split(' ',1)
    stend = time.clock()

    print "split2 :", stend-start, "%2.0f%%" % ((1-(stend-
start)/splittime)*100)

split2()

import re
def resplit():
    start = time.clock()
    r = re.compile(r'(\d+\.\d+\.\d+\.\d+)\s+(\d+\.\d+\.\d+\.\d+)\s+' \
            r'(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+' \
            r'(\d\d:\d\d:\d\d\.\d\d\d\d)\s+' \
            r'(\d\d:\d\d:\d\d\.\d\d\d\d)\s+' \
            r'(\d+)\s+(\d+)')

    for i in range(300000):
        array = r.match('111.111.111.111 222.222.222.222 6 ' 
                      '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 
2').groups()
    stend = time.clock()

    print "resplit:", stend-start, "%2.0f%%" % ((1-(stend-
start)/splittime)*100)

resplit()
---------------------------

-- 
Duncan Booth                                             duncan at rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?