Is there a faster way to do this?

Gary Herron gherron at islandtraining.com
Tue Aug 5 12:45:42 EDT 2008


ronald.johnson at gmail.com wrote:
> I have a csv file containing product information that is 700+ MB in
> size. I'm trying to go through and pull out unique product ID's only
> as there are a lot of multiples. My problem is that I am appending the
> ProductID to an array and then searching through that array each time
> to see if I've seen the product ID before. So each search takes longer
> and longer. I let the script run for 2 hours before killing it and had
> only run through less than 1/10 if the file.
>   

Store your ID's in a dictionary or a set.   Then test for for existence 
of a new ID in that set.  That test will be *much* more efficient that 
searching a list.  (It uses a hashing scheme.)


IDs = set()
for row in ...
   ID = extractIdFromRow(row)
   if ID not in IDs:
      set.add(ID)
      ... whatever ...


In fact if *all* you are doing is trying to identify all product IDs 
that occur in the file (no matter how many times they occur)

IDs = set()
for row in ...
   ID = extractIdFromRow(row)
   set,add(ID)

and your set is will contain *one* copy of each ID added, no matter how 
many were added.


Better yet, if you can write you ID extraction as a generator or list 
comprehension...

IDs = set(extractIdFromRow(row) for row in rowsOfTable)

or some such would be most efficient.


Gary Herron



> Heres the code:
> import string
>
> def checkForProduct(product_id, product_list):
>     for product in product_list:
>         if product == product_id:
>             return 1
>     return 0
>
>
> input_file="c:\\input.txt"
> output_file="c:\\output.txt"
> product_info = []
> input_count = 0
>
> input = open(input_file,"r")
> output = open(output_file, "w")
>
> for line in input:
>     break_down = line.split(",")
>     product_number = break_down[2]
>     input_count+=1
>     if input_count == 1:
>         product_info.append(product_number)
>         output.write(line)
>         output_count = 1
>     if not checkForProduct(product_number,product_info):
>         product_info.append(product_number)
>         output.write(line)
>         output_count+=1
>
> output.close()
> input.close()
> print input_count
> print output_count
> --
> http://mail.python.org/mailman/listinfo/python-list
>   




More information about the Python-list mailing list