Simple Text Processing Help

patrick.waldo at gmail.com patrick.waldo at gmail.com
Mon Oct 15 17:08:04 EDT 2007


Wow, thank you all.  All three work. To output correctly I needed to
add:

output.write("\r\n")

This is really a great help!!

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge.  Could you recommend some resources for this kind of text
manipulation?  Also, I conceptually get it, but would you mind walking
me through

> for tok in tokens:
>         if NR_RE.match(tok) and len(chem) >= 4:
>             chem[2:-1] = [' '.join(chem[2:-1])]
>             yield chem
>             chem = []
>         chem.append(tok)

and

> for key, group in groupby(instream, unicode.isspace):
>         if not key:
>             yield "".join(group)


Thanks again,
Patrick



On Oct 15, 2:16 pm, Peter Otten <__pete... at web.de> wrote:
> patrick.waldo wrote:
> > my sample input file looks like this( not organized,as you see it):
> > 200-720-7        69-93-2
> > kyselina mocová      C5H4N4O3
>
> > 200-001-8       50-00-0
> > formaldehyd      CH2O
>
> > 200-002-3
> > 50-01-1
> > guanidínium-chlorid      CH5N3.ClH
>
> Assuming that the records are always separated by blank lines and only the
> third field in a record may contain spaces the following might work:
>
> import codecs
> from itertools import groupby
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
>
> def fields(s):
>     parts = s.split()
>     return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]
>
> def records(instream):
>     for key, group in groupby(instream, unicode.isspace):
>         if not key:
>             yield "".join(group)
>
> if __name__ == "__main__":
>     outstream = codecs.open(path2, 'w', 'utf8')
>     for record in records(codecs.open(path, "r", "utf8")):
>         outstream.write("|".join(fields(record)))
>         outstream.write("\n")
>
> Peter





More information about the Python-list mailing list