Organize large DNA txt files

Fri Mar 20 14:06:50 EDT 2009

thomasvangurp at gmail.com wrote:
> Dear Fellow programmers,
> I'm using Python scripts too organize some rather large datasets
> describing DNA variation. Information is read, processed and written
> too a file in a sequential order, like this
> 1+
> 1-
> 2+
> 2-
> etc.. The files that i created contain positional information
> (nucleotide position) and some other info, ... i want 
 > [the two interleaved]
> 
> So the information should be sorted onto position.

If both are ordered, you can merge them easily on the fly.
This was standard back in the dark days of tape data.  The
running part is simple; starting and stopping, not so much.
Starting is easier if you can assume neither source is empty:

     def merge(source1, source2):
         a = iter(source1)
         b = iter(source2)
         first = flag = object()
         try:
             remainder = b
             ha = next(a)
             first = ha
             remainder = a
             hb = next(b)
             while True:
                 first, remainder = hb, b
                 while ha <= hb: # your comparison here
                     yield ha
                     ha = next(a)
                 first, remainder = ha, a
                 while hb <= ha: # again your comparison here
                     yield hb
                     hb = next(b)
         except StopIteration:
             pass
         if first is not flag:
             yield first
         for first in remainder:
             yield first

--Scott David Daniels
Scott.Daniels at Acm.Org