list comprehension help

Sun Mar 18 23:54:15 EDT 2007

George Sakkis <george.sakkis at gmail.com> wrote:

> On Mar 18, 12:11 pm, "rkmr... at gmail.com" <rkmr... at gmail.com> wrote:
> 
> > Hi
> > I need to process a really huge text file (4GB) and this is what i
> > need to do. It takes for ever to complete this. I read some where that
> > "list comprehension" can fast up things. Can you point out how to do
> > it in this case?
> > thanks a lot!
> >
> > f = open('file.txt','r')
> > for line in f:
> >         db[line.split(' ')[0]] = line.split(' ')[-1]
> >         db.sync()
> 
> You got several good suggestions; one that has not been mentioned but
> makes a big (or even the biggest) difference for large/huge file is
> the buffering parameter of open(). Set it to the largest value you can
> afford to keep the I/O as low as possible. I'm processing 15-25 GB
> files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
> setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
> compared to the default value. BerkeleyDB should have a buffering
> option too, make sure you use it and don't synchronize on every line.

Out of curiosity, what OS and FS are you using?  On a well-tuned FS and
OS combo that does "read-ahead" properly, I would not expect such
improvements for moving from large to huge buffering (unless some other
pesky process is perking up once in a while and sending the disk heads
on a quest to never-never land).  IOW, if I observed this performance
behavior on a server machine I'm responsible for, I'd look for
system-level optimizations (unless I know I'm being forced by myopic
beancounters to run inappropriate OSs/FSs, in which case I'd spend the
time polishing my resume instead) - maybe tuning the OS (or mount?)
parameters, maybe finding a way to satisfy the "other pesky process"
without flapping disk heads all over the prairie, etc, etc.

The delay of filling a "1 GB or more" buffer before actual processing
can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
that is, something bad is seriously interfering with the normal
read-ahead system level optimization... and in that case I'd normally be
more interested in finding and squashing the "something bad", than in
trying to work around it by overprovisioning application bufferspace!-)

Alex