improved search algorithm

MRAB google at mrabarnett.plus.com
Mon May 4 14:45:23 EDT 2009


grahamdick77 at gmail.com wrote:
> Hi
> 
> I have an excel file that is read into python (8000 rows)
> 
> from csv        import reader, writer
> incsv = reader(open(MY_FILE), dialect='excel')
> keys = incsv.next()
> 
> There are mixed datatypes.
> 
> the last column contains a cumulative frequency running in order
> 0.0000 to 1.0000 for the 8000 rows
> 
> for a loop of 100,000 times I want to take a new random number each
> time and find the row with the next biggest cumulative frequency value
> 
> Here's my current (pseudo)code:
> 
> for 1 to 100000
> 
> myRand = random.random()
> for line in incsv:
>             if float(line[-1]) > myRand:
>                 resline = []
>                 for item in line:
>                     try:
>                         i = int(item)
>                     except ValueError:
>                         try:
>                             i = float(item)
>                         except ValueError:
>                             i = item
>                     resline.append(i)
>                 #Here we construct a dict of pair values:
>                 #{'ID':18,...
>                 res = dict(zip(keys,resline))
>                 break
>             else:
>                 continue
> 
>       #do some stuff with res
> 
> 
> 
> 
> I'm scanning over each line of the csv and deciding which row to
> select (100k times) this is just not very efficient
> 
> How can i improve this code.
> for line in incsv:
>             if float(line[-1]) > random.random():
> 
> I can use numpy etc. whatever
> 
Here's a suggestion:

Construct the dicts for all the rows, stored in a list.

Construct a list of just the cumulative frequencies.

For each random value, use the bisect module to search for the value in
the cumulative frequencies list. This will return an index which you can
then use on your list of dicts.



More information about the Python-list mailing list