speeding up string.split()

Fri May 25 09:49:47 EDT 2001

duncan at NOSPAMrcp.co.uk (Duncan Booth) writes:

> Chris Green <cmg at uab.edu> wrote in 
> news:m2n182cs9c.fsf at phosphorus.tucc.uab.edu:
> 
> > Is there any way to speed up the following code?

> You haven't given much to go on here. Any real speedups are likely to 
> depend very much on what you want to do with the data after you have split 
> it.

Sorry about that.  I had factored out the portion that I thought was
slowing me down.  For all possibilities to be considered, let me state
the nature of the data.

Using ipaudit (http://ipaudit.sourceforge.net), I get gzip'd files
containing details on internet traffic.  Each hour on my connection
yields roughly 330,000 lines of the format below.  I have done the
same set of things with a gzip'd file and the speed opt for it was to
do readlines() rather than line by line.

I wish to be able to define searchs on each of the 13 data fields
breaking it up into a dict for each part of the field seemed like the
cleanest way ( not fasted ).  I'm willing to rely on ``good'' input
data from the backend.

Disappointed by the speed of splitting and then creating a dict, I
got rid of the dict step since I knew it was positional in nature and
then discovered that the split was a bottleneck.  I might be able to
use split and map to do my bidding and will try that but I'm not very
hopeful.

The string concatentation operation is purely artificial to make
posting on usenet fit in 72 cols. 

> > #!/usr/bin/python
> > from string import split
> > 
> > for i in range(300000):
> >     array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' +  
> >                   '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
> > 

[ only applies to my 3 line example and not the real problem ]

> 
> Alternatively:
> 3. Put the code inside a function.

I will try.  My first example was

function('xxx ....') but you are suggesting that the operation is
bottlenecked by not being able to be optimized inside a def.

I moved the entire function to a def speed_test(): and then called
speed_test() from main. Looking at just the user times, it seems like
a small speedup in my unscientific testings.

inside function:
./speed.py  9.04s user 0.07s system 99% cpu 9.155 total
./speed.py  9.22s user 0.03s system 98% cpu 9.414 total

outside function:
./speed.py  9.30s user 0.02s system 100% cpu 9.312 total
./speed.py  9.29s user 0.03s system 100% cpu 9.312 total

> 4. Use the split method on the string instead of the split function

This did give a noticable improvement but limits me to python 2.0+
AFAICT.

> 5. Use string concatenation instead of '+'
> 3, 4 and 5 together knock about 25% off the running time.
> 
> 6. If whatever you intend to do with the data involves filtering it on the 
> first field or two, then using "xxx...".split(' ', 1) is very much faster 
> than splitting up all the fields. This can reduce the time by two thirds 
> easily.

Yes this will work too and is how the author of the program that
produces the data (uses zgrep to filter out based on ip).  It's proven
to be very useful data to be able to mine for traffic patterns ( esp
on more than just src/dest ip ) and I've got a perl implementation
already that is at the limits of acceptable for interactive use ( CGI
).
> 
> 7. Use Perl, or C, or whatever else takes your fancy if speed is that 
> critical.

I probably will have to use C to do the speed filterings and python to
build an abstraction off that take the orignal approach.\

Thanks for your speedup tips.
-- 
Chris Green <cmg at uab.edu>
"Yeah, but you're taking the universe out of context."