Simple Text Processing Help
Peter Otten
__peter__ at web.de
Mon Oct 15 08:16:10 EDT 2007
patrick.waldo wrote:
> my sample input file looks like this( not organized,as you see it):
> 200-720-7 69-93-2
> kyselina mocová C5H4N4O3
>
> 200-001-8 50-00-0
> formaldehyd CH2O
>
> 200-002-3
> 50-01-1
> guanidínium-chlorid CH5N3.ClH
Assuming that the records are always separated by blank lines and only the
third field in a record may contain spaces the following might work:
import codecs
from itertools import groupby
path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
def fields(s):
parts = s.split()
return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]
def records(instream):
for key, group in groupby(instream, unicode.isspace):
if not key:
yield "".join(group)
if __name__ == "__main__":
outstream = codecs.open(path2, 'w', 'utf8')
for record in records(codecs.open(path, "r", "utf8")):
outstream.write("|".join(fields(record)))
outstream.write("\n")
Peter
More information about the Python-list
mailing list