Simple Text Processing Help
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Mon Oct 15 07:20:47 EDT 2007
On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:
> my sample input file looks like this( not organized,as you see it):
> 200-720-7 69-93-2
> kyselina mocová C5H4N4O3
>
> 200-001-8 50-00-0
> formaldehyd CH2O
>
> 200-002-3
> 50-01-1
> guanidínium-chlorid CH5N3.ClH
>
> etc...
That's quite irregular so it is not that straightforward. One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters. Quick and dirty hack:
import codecs
import re
NR_RE = re.compile(r'^\d+-\d+-\d+$')
def iter_elements(tokens):
tokens = iter(tokens)
try:
nr_a = tokens.next()
while True:
nr_b = tokens.next()
items = list()
for item in tokens:
if NR_RE.match(item):
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
nr_a = item
break
else:
items.append(item)
except StopIteration:
yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
def main():
in_file = codecs.open('test.txt', 'r', 'utf-8')
tokens = in_file.read().split()
in_file.close()
for element in iter_elements(tokens):
print '|'.join(element)
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list