Generator Expressions and CSV

Fri Jul 17 17:31:30 EDT 2009

Zaki wrote:
> On Jul 17, 2:49 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>> Zaki wrote:
>>> Hey all,
>>> I'm really new to Python and this may seem like a really dumb
>>> question, but basically, I wrote a script to do the following, however
>>> the processing time/memory usage is not what I'd like it to be. Any
>>> suggestions?
>>> Outline:
>>> 1. Read tab delim files from a directory, files are of 3 types:
>>> install, update, and q. All 3 types contain ID values that are the
>>> only part of interest.
>>> 2. Using set() and set.add(), generate a list of unique IDs from
>>> install and update files.
>>> 3. Using the set created in (2), check the q files to see if there are
>>> matches for IDs. Keep all matches, and add any non matches (which only
>>> occur once in the q file) to a queue of lines to be removed from teh q
>>> files.
>>> 4. Remove the lines in the q for each file. (I haven't quite written
>>> the code for this, but I was going to implement this using csv.writer
>>> and rewriting all the lines in the file except for the ones in the
>>> removal queue).
>>> Now, I've tried running this and it takes much longer than I'd like. I
>>> was wondering if there might be a better way to do things (I thought
>>> generator expressions might be a good way to attack this problem, as
>>> you could generate the set, and then check to see if there's a match,
>>> and write each line that way).
>> Why are you checking and removing lines in 2 steps? Why not copy the
>> matching lines to a new q file and then replace the old file with the
>> new one (or, maybe, delete the new q file if no lines were removed)?
> 
> That's what I've done now.
> 
> Here is the final code that I have running. It's very much 'hack' type
> code and not at all efficient or optimized and any help in optimizing
> it would be greatly appreciated.
> 
> import csv
> import sys
> import os
> import time
> 
> begin = time.time()
> 
> #Check minutes elapsed
> def timeElapsed():
>     current = time.time()
>     elapsed = current-begin
>     return round(elapsed/60)
> 
> 
> #USAGE: python logcleaner.py <input_dir> <output_dir>
> 
> inputdir = sys.argv[1]
> outputdir = sys.argv[2]
> 
> logfilenames = os.listdir(inputdir)
> 
> 
> 
> IDs = set() #IDs from update and install logs
> foundOnceInQuery = set()
> #foundTwiceInQuery = set()
> #IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
> Queue of IDs to remove from query logs (IDs found only once in query
> logs)
> 
> #Generate Filename Queues For Install/Update Logs, Query Logs
> iNuQ = []
> queryQ = []
> 
> for filename in logfilenames:
>     if filename.startswith("par1.install") or filename.startswith
> ("par1.update"):
      if filename.startswith(("par1.install", "par1.update")):

>         iNuQ.append(filename)
>     elif filename.startswith("par1.query"):
>         queryQ.append(filename)
> 
> totalfiles = len(iNuQ) + len(queryQ)
> print "Total # of Files to be Processed:" , totalfiles
> print "Install/Update Logs to be processed:" , len(iNuQ)
> print "Query logs to be processed:" , len(queryQ)
> 
> #Process install/update queue to generate list of valid IDs
> currentfile = 1
> for file in iNuQ:
 >     print "Processing", currentfile, "install/update log out of", len
 > (iNuQ)
 >     print timeElapsed()
 >     reader = csv.reader(open(inputdir+file),delimiter = '\t')
 >     for row in reader:
 >         IDs.add(row[2])
 >     currentfile+=1

Best not to call it 'file'; that's a built-in name.

Also you could use 'enumerate', and joining filepaths is safer with
os.path.join().

for currentfile, filename in enumerate(iNuQ, start=1):
     print "Processing", currentfile, "install/update log out of", len(iNuQ)
     print timeElapsed()
     current_path = os.path.join(inputdir, filename)
     reader = csv.reader(open(current_path), delimiter = '\t')
     for row in reader:
         IDs.add(row[2])

> 
> print "Finished processing install/update logs"
> print "Unique IDs found:" , len(IDs)
> print "Total Time Elapsed:", timeElapsed()
> 
> currentfile = 1
> for file in queryQ:

Similar remarks to above ...

>     print "Processing", currentfile, "query log out of", len(queryQ)
>     print timeElapsed()
>     reader = csv.reader(open(inputdir+file), delimiter = '\t')
>     outputfile = csv.writer(open(outputdir+file), 'w')

... and also here.

>     for row in reader:
>         if row[2] in IDs:
>             ouputfile.writerow(row)

Should be 'outputfile'.

>         else:
>             if row[2] in foundOnceInQuery:
>                 foundOnceInQuery.remove(row[2])

You're removing the ID here ...

>                 outputfile.writerow(row)
>                 #IDremovalQ.remove(row[2])
>                 #foundTwiceInQuery.add(row[2])
> 
>             else:
>                 foundOnceInQuery.add(row[2])

... and adding it again here!

>                 #IDremovalQ.add(row[2])
> 
> 
>     currentfile+=1
> 
For safety you should close the files after use.

> print "Finished processing query logs and writing new files"
> print "# of Query log entries removed:" , len(foundOnceInQuery)
> print "Total Time Elapsed:", timeElapsed()
> 
Apart from that, it looks OK.

How big are the q files? If they're not too big and most of the time
you're not removing rows, you could put the output rows into a list and
then create the output file only if rows have been removed, otherwise
just copy the input file, which might be faster.