Simple Text Processing Help
Paul Hankin
paul.hankin at gmail.com
Mon Oct 15 07:54:20 EDT 2007
On Oct 15, 12:20 pm, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:
> > my sample input file looks like this( not organized,as you see it):
> > 200-720-7 69-93-2
> > kyselina mocová C5H4N4O3
>
> > 200-001-8 50-00-0
> > formaldehyd CH2O
>
> > 200-002-3
> > 50-01-1
> > guanidínium-chlorid CH5N3.ClH
>
> > etc...
>
> That's quite irregular so it is not that straightforward. One way is to
> split everything into words, start a record by taking the first two
> elements and then look for the start of the next record that looks like
> three numbers concatenated by '-' characters. Quick and dirty hack:
>
> import codecs
> import re
>
> NR_RE = re.compile(r'^\d+-\d+-\d+$')
>
> def iter_elements(tokens):
> tokens = iter(tokens)
> try:
> nr_a = tokens.next()
> while True:
> nr_b = tokens.next()
> items = list()
> for item in tokens:
> if NR_RE.match(item):
> yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
> nr_a = item
> break
> else:
> items.append(item)
> except StopIteration:
> yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
Maybe this is a bit more readable?
def iter_elements(tokens):
chem = []
for tok in tokens:
if NR_RE.match(tok) and len(chem) >= 4:
chem[2:-1] = [' '.join(chem[2:-1])]
yield chem
chem = []
chem.append(tok)
yield chem
--
Paul Hankin
More information about the Python-list
mailing list