joining rows

Sat Dec 29 13:57:27 EST 2007

> on a second read ... I see that you mean the case that should only
> join consecutive lines with the same key

Yes...there are actually three cases that occur to me:

1) don't care about order, but want one row for each key (1st value)

2) do care about order, and don't want disjoint runs of duplicate
keys to be smashed together

3) do care about order, and do want disjoint runs to be smashed
together (presumably outputting in the key-order as they were
encountered in the file...if not, you'd have to clarify)

My original post addresses #1 and #2, but not #3.  Some tweaks to
my solution for #1 should address #3:

  results = {}
  order = []
  for line in file('in.txt'):
    k,v = line.rstrip('\n').split('\t')
    if k not in results:
      order.append(k)
    results.setdefault(k, []).append(v)
  for k in order:
    print k, '|'.join(results[k])

#2 does have the advantage that it can process large (multi-gig)
streams of data without bogging down as it behaves like the sed
version, processing only a window at a time and retaining only
data for consecutively matching lines.

-tkc