Is there a faster way to do this?

Tue Aug 5 13:02:34 EDT 2008

Avinash Vora wrote:
> On Aug 5, 2008, at 10:00 PM, ronald.johnson at gmail.com wrote:
>
>> I have a csv file containing product information that is 700+ MB in
>> size. I'm trying to go through and pull out unique product ID's only
>> as there are a lot of multiples. My problem is that I am appending the
>> ProductID to an array and then searching through that array each time
>> to see if I've seen the product ID before. So each search takes longer
>> and longer. I let the script run for 2 hours before killing it and had
>> only run through less than 1/10 if the file.
>
> Why not split the file into more manageable chunks, especially as it's 
> just what seems like plaintext?
>
>> Heres the code:
>> import string
>>
>> def checkForProduct(product_id, product_list):
>>    for product in product_list:
>>        if product == product_id:
>>            return 1
>>    return 0
>>
>>
>> input_file="c:\\input.txt"
>> output_file="c:\\output.txt"
>> product_info = []
>> input_count = 0
>>
>> input = open(input_file,"r")
>> output = open(output_file, "w")
>>
>> for line in input:
>>    break_down = line.split(",")
>>    product_number = break_down[2]
>>    input_count+=1
>>    if input_count == 1:
>>        product_info.append(product_number)
>>        output.write(line)
>>        output_count = 1
>
> This seems redundant.
>
>>    if not checkForProduct(product_number,product_info):
>>        product_info.append(product_number)
>>        output.write(line)
>>        output_count+=1
>
> File writing is extremely expensive.  In fact, so is reading.  Think 
> about reading the file in whole chunks.  Put those chunks into Python 
> data structures, and make your output information in Python data 
> structures.

Don't bother yourself with this suggestion about reading in chunks -- 
Python already does this for you, and does so more efficiently that you 
could.  The code
  for line in open(input_file,"r"):
reads in large chunks (efficiently) and then serves up the contents 
line-by-line.

Gary Herron

> If you use a dictionary and search the ID's there, you'll notice some 
> speed improvements as Python does a dictionary lookup far quicker than 
> searching a list.  Then, output your data all at once at the end.
>
> -- 
> Avi
>
> -- 
> http://mail.python.org/mailman/listinfo/python-list