Simple Text Processing Help

Mon Oct 15 07:20:47 EDT 2007

On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:

> my sample input file looks like this( not organized,as you see it):
> 200-720-7        69-93-2
> kyselina mocová      C5H4N4O3
> 
> 200-001-8       50-00-0
> formaldehyd      CH2O
> 
> 200-002-3
> 50-01-1
> guanidínium-chlorid      CH5N3.ClH
> 
> etc...

That's quite irregular so it is not that straightforward.  One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters.  Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$')

def iter_elements(tokens):
    tokens = iter(tokens)
    try:
        nr_a = tokens.next()
        while True:
            nr_b = tokens.next()
            items = list()
            for item in tokens:
                if NR_RE.match(item):
                    yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
                    nr_a = item
                    break
                else:
                    items.append(item)
    except StopIteration:
        yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

def main():
    in_file = codecs.open('test.txt', 'r', 'utf-8')
    tokens = in_file.read().split()
    in_file.close()
    for element in iter_elements(tokens):
        print '|'.join(element)

Ciao,
	Marc 'BlackJack' Rintsch