Simple Text Processing Help

Paul McGuire ptmcg at austin.rr.com
Tue Oct 16 00:10:24 EDT 2007


On Oct 14, 8:48 am, patrick.wa... at gmail.com wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file.  The
> information is always EINECS number, CAS, chemical name, and formula
> in tables.  I need to organize them into lines with | in between.  So
> it goes from:
>
> 200-763-1                     71-73-8
> nátrium-tiopentál           C11H18N2O2S.Na           to:
>
> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina močová
>
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> and then it is all off.

Pyparsing might be overkill for this example, but it is a good sample
for a demo.  If you end up doing lots of data extraction like this,
pyparsing is a useful tool.  In pyparsing, you define expressions
using pyparsing classes and built-in strings, then use the constructed
pyparsing expression to parse the data (using parseString, scanString,
searchString, or transformString).  In this example, searchString is
the easiest to use.  After the parsing is done, the parsed fields are
returned in a ParseResults object, which has some list and some dict
style behavior.  I've given each field a name based on your post, so
that you can read the tokens right out of the results as if they were
attributes of an object.  This example emits your '|' delimited data,
but the commented lines show how you could access the individually
parsed fields, too.

Learn more about pyparsing at http://pyparsing.wikispaces.com/ .

-- Paul


# -*- coding: iso-8859-15 -*-

data = """200-720-7        69-93-2
kyselina mocová      C5H4N4O3


200-001-8       50-00-0
formaldehyd      CH2O


200-002-3
50-01-1
guanidínium-chlorid      CH5N3.ClH

"""

from pyparsing import Word, nums,OneOrMore,alphas,alphas8bit

# define expressions for each part in the input data

# a numeric id starts with a number, and is followed by
# any number of numbers or '-'s
numericId = Word(nums, nums+"-")

# a chemical name is one or more words, each made up of
# alphas (including 8-bit alphas) or '-'s
chemName = OneOrMore(Word(alphas.lower()+alphas8bit.lower()+"-"))

# when returning the chemical name, rejoin the separate
# words into a single string, with spaces
chemName.setParseAction(lambda t:" ".join(t))

# a chemical formula is a 'word' starting with an uppercase
# alpha, followed by uppercase alphas or numbers
chemFormula = Word(alphas.upper(), alphas.upper()+nums)

# put all expressions into overall form, and attach field names
entry = numericId("EINECS") + \
        numericId("CAS") + \
        chemName("name") + \
        chemFormula("formula")

# search through input data, and print out retrieved data
for chemData in entry.searchString(data):
    print "%(EINECS)s|%(CAS)s|%(name)s|%(formula)s" % chemData
    # or print each field by itself
    # print chemData.EINECS
    # print chemData.CAS
    # print chemData.name
    # print chemData.formula
    # print


prints:
200-720-7|69-93-2|kyselina mocová|C5H4N4O3
200-001-8|50-00-0|formaldehyd|CH2O
200-002-3|50-01-1|guanidínium-chlorid|CH5N3




More information about the Python-list mailing list