How to remove subset from a file efficiently?

AJL unixfd0.n0spam at yahoo.com
Fri Jan 13 10:44:09 EST 2006


On 12 Jan 2006 22:29:22 -0800
"Raymond Hettinger" <python at rcn.com> wrote:

> AJL wrote:
> > How fast does this run?
> >
> > a = set(file('PSP0000320.dat'))
> > b = set(file('CBR0000319.dat'))
> > file('PSP-CBR.dat', 'w').writelines(a.difference(b))
> 
> Turning PSP into a set takes extra time, consumes unnecessary memory,
> eliminates duplicates (possibly a bad thing), and loses the original
> input ordering (probably a bad thing).
> 
> To jam the action into a couple lines, try this:
> 
> b = set(file('CBR0000319.dat'))
> file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b.__contains__,file('PSP0000320.dat')))
> 
> Raymond
> 

The OP said "assume machine has plenty memory". ;)

I saw some solutions that used sets and was wondering why they stopped
at using a set for the first file and not the second when the problem is
really a set problem but I can see the reasoning behind it now.

AJL



More information about the Python-list mailing list