speeding up string.split()

Fredrik Aronsson d98aron at dtek.chalmers.se
Fri May 25 16:40:34 EDT 2001


In article <m2n182cs9c.fsf at phosphorus.tucc.uab.edu>,
	Chris Green <cmg at uab.edu> writes:
> Is there any way to speed up the following code?  Speed doesn't matter
> terribly much to me but this seems to be a fair example of what I need
> to do.
> 
> In the real data, I will be able to have a large array and use
> map rather than do it line by line but, I doubt this will change
> things much for the better.

In my case, it actually made it worse... and the reason is probably 
that map and list comprehensions builds a new list.
So, if you are only going to extract data and not build a new list,
it's prbably faster with normal indexing. 

> I've tried w/ python 2 and 1.5.2 and the differences between perl and
> python remain huge ( about 5s:1s python:perl ).

Yep, perl is optimized for text processing. 
>From "man perl":
     Perl is a language optimized  for  scanning  arbitrary  text
     files,  extracting  information  from  those text files, and
     printing reports based on that information....

> 
> The string is +'d together for usenet purposes
> 
> #!/usr/bin/python
> from string import split
> 
> for i in range(300000):
>     array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 6' +  
>                   '1064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
[perl snipped]

The best speedup I saw was by using string methods, time was cut down 
to about half. YMMW

/Fredrik

-- results --

running on a constant string 100000 times...

original 4.63118143082 [4.543, 4.645, 4.638, 4.670, 4.659]
in_func 4.38402023315 [4.317, 4.408, 4.414, 4.390, 4.391]
splitting_on_space 4.64088983536 [4.643, 4.647, 4.640, 4.642, 4.633]

running on a real 10000 item array...

normal_for 15.3752954006 [15.440, 15.350, 15.388, 15.346, 15.353]
index_for 16.7716764212 [16.876, 16.843, 16.685, 16.771, 16.683]
index_for_using_xrange 16.8590580225 [16.830, 16.781, 16.769, 16.780, 17.135]
index_for_local_var 15.8590993881 [15.728, 15.895, 15.892, 15.883, 15.896]
map_split 22.4902464151 [22.262, 22.339, 22.727, 23.051, 22.073]
map_split_local_var 22.2637700081 [22.089, 22.436, 22.830, 21.799, 22.166]
string_method 7.49720318317 [7.569, 7.486, 7.481, 7.481, 7.469]
list_comp_split 19.7473443985 [19.551, 19.909, 20.321, 19.301, 19.653]

-- code (not usenet friendly... long lines) --

from time import time
# probably better to use the profile module... but this is simple.

from string import split

# create long list...
large_list = 'xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2' * 10000

# Functions

def in_func():
    for i in range(100000):
        array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')

def splitting_on_space():
    for i in range(100000):
        array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2',' ')

def normal_for():
    for i in large_list:
        array = split(i)

def index_for():
    for i in range(len(large_list)):
        array = split(large_list[i])

def index_for_using_xrange():
    for i in xrange(len(large_list)):
        array = split(large_list[i])

def index_for_local_var():
    mylist = large_list
    mysplit = split
    for i in range(len(mylist)):
        array = mysplit(mylist[i])

def map_split():
    for array in map(split,large_list):
        pass

def map_split_local_var():
    mysplit = split
    mylist = large_list    
    for array in map(mysplit,mylist):
        pass

def string_method():
    for i in large_list:
        array = i.split()

def list_comp_split():
    for array in [l.split() for l in large_list]:
        pass

funcs = [in_func,splitting_on_space,
         normal_for,index_for,index_for_using_xrange,index_for_local_var,
         map_split,map_split_local_var,string_method,list_comp_split]

#Timings...

def avg(list):
    from operator import add
    return reduce(add,list)/len(list)

times = {}
def process_time(name,diff):
    times.setdefault(name,[]).append(end-start)
    print name, avg(times[name]),
    print "[" + ", ".join(["%.3f" % t for t in times[name]]) + "]"    

for l in range(5):
    print "\n\nRun no.",l+1

    start = time()
    for i in range(100000):
        array = split('xxx.xxx.xxx.xxx yyy.yyy.yyy.yyy 61064 80  54 54 1 1 14:00:00.8094 14:00:00.8908 1 2')
    end = time()
    process_time("original",end-start)

    for func in funcs:
        start = time()
        func()
        end = time()
        process_time(func.func_name,end-start)



More information about the Python-list mailing list