Joining Big Files

Sun Aug 26 10:45:15 EDT 2007

On Aug 26, 6:48 am, Paul McGuire <pt... at austin.rr.com> wrote:
> On Aug 25, 8:15 pm, Paul McGuire <pt... at austin.rr.com> wrote:
>
>
>
> > On Aug 25, 4:57 am, mosscliffe <mcl.off... at googlemail.com> wrote:
>
> > > I have 4 text files each approx 50mb.
>
> > <yawn> 50mb? Really?  Did you actually try this and find out it was a
> > problem?
>
> > Try this:
> > import time
>
> > start = time.clock()
> > outname = "temp.dat"
> > outfile = file(outname,"w")
> > for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
> >     infile = file(inname)
> >     outfile.write( infile.read() )
> >     infile.close()
> > outfile.close()
> > end = time.clock()
>
> > print end-start,"seconds"
>
> > For 4 30Mb files, this takes just over 1.3 seconds on my system.  (You
> > may need to open files in binary mode, depending on the contents, but
> > I was in a hurry.)
>
> > -- Paul
>
> My bad, my test file was not a text file, but a binary file.
> Retesting with a 50Mb text file took 24.6 seconds on my machine.
>
> Still in your working range?  If not, then you will need to pursue
> more exotic approaches.  But 25 seconds on an infrequent basis does
> not sound too bad, especially since I don't think you will really get
> any substantial boost from them (to benchmark this, I timed a raw
> "copy" command at the OS level of the resulting 200Mb file, and this
> took about 20 seconds).
>
> Keep it simple.
>
> -- Paul

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
    infile = file(inname)
    for lin in infile:
        outfile.write(lin)
    infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file.  Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of  outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimization_(computer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ram
http://www.dancingbison.com
http://jugad.livejournal.com
http://sourceforge.net/projects/xtopdf
-----------------------------------------