Simple Text Processing Help

Mon Oct 15 07:54:20 EDT 2007

On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:
> > my sample input file looks like this( not organized,as you see it):
> > 200-720-7        69-93-2
> > kyselina mocová      C5H4N4O3
>
> > 200-001-8       50-00-0
> > formaldehyd      CH2O
>
> > 200-002-3
> > 50-01-1
> > guanidínium-chlorid      CH5N3.ClH
>
> > etc...
>
> That's quite irregular so it is not that straightforward.  One way is to
> split everything into words, start a record by taking the first two
> elements and then look for the start of the next record that looks like
> three numbers concatenated by '-' characters.  Quick and dirty hack:
>
> import codecs
> import re
>
> NR_RE = re.compile(r'^\d+-\d+-\d+$')
>
> def iter_elements(tokens):
>     tokens = iter(tokens)
>     try:
>         nr_a = tokens.next()
>         while True:
>             nr_b = tokens.next()
>             items = list()
>             for item in tokens:
>                 if NR_RE.match(item):
>                     yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
>                     nr_a = item
>                     break
>                 else:
>                     items.append(item)
>     except StopIteration:
>         yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

Maybe this is a bit more readable?

def iter_elements(tokens):
    chem = []
    for tok in tokens:
        if NR_RE.match(tok) and len(chem) >= 4:
            chem[2:-1] = [' '.join(chem[2:-1])]
            yield chem
            chem = []
        chem.append(tok)
    yield chem

--
Paul Hankin