Simple Text Processing Help

Mon Oct 15 08:16:10 EDT 2007

patrick.waldo wrote:

> my sample input file looks like this( not organized,as you see it):
> 200-720-7        69-93-2
> kyselina mocová      C5H4N4O3
> 
> 200-001-8       50-00-0
> formaldehyd      CH2O
> 
> 200-002-3
> 50-01-1
> guanidínium-chlorid      CH5N3.ClH

Assuming that the records are always separated by blank lines and only the
third field in a record may contain spaces the following might work:

import codecs
from itertools import groupby

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"

def fields(s):
    parts = s.split()
    return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]

def records(instream):
    for key, group in groupby(instream, unicode.isspace):
        if not key: 
            yield "".join(group)

if __name__ == "__main__":
    outstream = codecs.open(path2, 'w', 'utf8')
    for record in records(codecs.open(path, "r", "utf8")):
        outstream.write("|".join(fields(record)))
        outstream.write("\n")

Peter