Joining Big Files

mcl mcl.office at googlemail.com
Sun Aug 26 15:43:13 EDT 2007


On 26 Aug, 15:45, vasudevram <vasudev... at gmail.com> wrote:
> On Aug 26, 6:48 am, Paul McGuire <pt... at austin.rr.com> wrote:
>
>
>
> > On Aug 25, 8:15 pm, Paul McGuire <pt... at austin.rr.com> wrote:
>
> > > On Aug 25, 4:57 am,mosscliffe<mcl.off... at googlemail.com> wrote:
>
> > > > I have 4 text files each approx 50mb.
>
> > > <yawn> 50mb? Really?  Did you actually try this and find out it was a
> > > problem?
>
> > > Try this:
> > > import time
>
> > > start = time.clock()
> > > outname = "temp.dat"
> > > outfile = file(outname,"w")
> > > for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
> > >     infile = file(inname)
> > >     outfile.write( infile.read() )
> > >     infile.close()
> > > outfile.close()
> > > end = time.clock()
>
> > > print end-start,"seconds"
>
> > > For 4 30Mb files, this takes just over 1.3 seconds on my system.  (You
> > > may need to open files in binary mode, depending on the contents, but
> > > I was in a hurry.)
>
> > > -- Paul
>
> > My bad, my test file was not a text file, but a binary file.
> > Retesting with a 50Mb text file took 24.6 seconds on my machine.
>
> > Still in your working range?  If not, then you will need to pursue
> > more exotic approaches.  But 25 seconds on an infrequent basis does
> > not sound too bad, especially since I don't think you will really get
> > any substantial boost from them (to benchmark this, I timed a raw
> > "copy" command at the OS level of the resulting 200Mb file, and this
> > took about 20 seconds).
>
> > Keep it simple.
>
> > -- Paul
>
> There are (at least) another couple of approaches possible, each with
> some possible tradeoffs or requirements:
>
> Approach 1. (Least amount of code to write - not that the others are
> large :)
>
> Just use os.system() and the UNIX cat command - the requirement here
> is that:
> a) your web site is hosted on *nix (ok, you can do it if on Windows
> too - use copy instead of cat, you might have to add a "cmd /c "
> prefix in front of the copy command, and you have to use the right
> copy command syntax for concatenating multiple input files into one
> output file).
>
> b) your hosting plan allows you to execute OS level commands like cat,
> and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
> for Windows hosts).
>
> import os
> os.system("cat file1.txt file2.txt file3.txt file4.txt >
> file_out.txt")
>
> cat will take care of buffering, etc. transparently to you.
>
> Approach 2: Read (in a loop, as you originally thought of doing) each
> line of each of the 4 input files and write it to the output file:
>
> ("Reusing" Paul McGuire's code above:)
>
> outname = "temp.dat"
> outfile = file(outname,"w")
> for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
>     infile = file(inname)
>     for lin in infile:
>         outfile.write(lin)
>     infile.close()
> outfile.close()
> end = time.clock()
>
> print end-start,"seconds"
>
> # You may need to check that newlines are not removed in the above
> code, in the output file.  Can't remember right now. If they are, just
> add one back with:
>
> outfile.write(lin + "\n") instead of  outfile.write(lin) .
>
> ( Code not tested, test it locally first, though looks ok to me. )
>
> The reason why this _may_ not be much slower than manually coded
> buffering approaches, is that:
>
> a) Python's standard library is written in C (which is fast),
> including use of stdio (the C Standard IO library, which already does
> intelligent buffering)
> b) OS's do I/O buffering anyway, so do hard disk controllers
> c) from some recent Python version, I think it was 2.2, that idiom
> "for lin in infile" has been (based on somethng I read in the Python
> Cookbook) stated to be pretty efficient anyway (and yet (slightly)
> more readable that earlier followed approaches of reading a text
> file).
>
> Given all the above facts, it probably isn't worth your while to try
> and optimize the code unless and until you find (by measurements) that
> it's too slow - which is a good practice anyway:
>
> http://en.wikipedia.org/wiki/Optimization_(computer_science)
>
> Excerpt from the above page (its long but worth reading, IMO):
>
> "Donald Knuth said, paraphrasing Hoare[1],
>
> "We should forget about small efficiencies, say about 97% of the time:
> premature optimization is the root of all evil." (Knuth, Donald.
> Structured Programming with go to Statements, ACM Journal Computing
> Surveys, Vol 6, No. 4, Dec. 1974. p.268.)
>
> Charles Cook commented,
>
> "I agree with this. It's usually not worth spending a lot of time
> micro-optimizing code before it's obvious where the performance
> bottlenecks are. But, conversely, when designing software at a system
> level, performance issues should always be considered from the
> beginning. A good software developer will do this automatically,
> having developed a feel for where performance issues will cause
> problems. An inexperienced developer will not bother, misguidedly
> believing that a bit of fine tuning at a later stage will fix any
> problems." [2]
> "
>
> HTH
> Vasudev
> -----------------------------------------
> Vasudev Ramhttp://www.dancingbison.comhttp://jugad.livejournal.comhttp://sourceforge.net/projects/xtopdf
> -----------------------------------------

All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being a big machine and 640k being gigantic. I get very worried
about crashing machines when copying or editing big files, especially
in a multi-user environment.

Mr Knuth - that brings back memories.  I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard






More information about the Python-list mailing list