Line Text Parsing

Paul McGuire ptmcg at austin.stopthespam_rr.com
Wed Feb 4 23:30:54 EST 2004


"allanc" <kawNOSPAMenks at nospamyahoo.ca> wrote in message
news:Xns948575A2C930Aacuencacanadacom at 198.161.157.145...
> I'm new with python so bear with me.
>
> I'm looking for a way to elegantly parse fixed-width text data (as opposed
> to CSV) and saving the parsed data unto a database. The text data comes
> from an old ISAM-format table and each line may be a different record
> structure depending on key fields in the line.
>
> RegExp with match and split are of interest but it's been too long since
> I've dabbled with RE to be able to judge whether its use will make the
> problem more complex.
>
> Here's a sample of the records I need to parse:
>
> 01508390019002      11284361000002SUGARPLUM
> 015083915549           SHORT ON LAST ORDER
> 0150839220692 000002EA BMC   15 KG   001400
>
> 1st Line is a (portion of) header record.
> 2nd Line is an text instruction record.
> 3rd Line is a Transaction Line Item record.
>
> Each type of record has a different structure. But these set of lines
> appear in the one table.
>
>
> Any ideas would be greatly appreciated.
>
> Allan
Allan -

Let me put in a plug for pyparsing.  I think your problem is tailor-made for
pyparsing's easy-to-use grammar definitions and execution.  No special
lexx/yacc-like syntax or RE symbology to master, you assemble your grammar
using simply-named classes (such as Literal, OneOrMore, Word(wordchars),
Optional, etc.) and intuitive operators (+ for sequence, | for greedy
alternation, ^ for longest-match alternation, ~ for, um, Not-tion).

A grammar to parse "Hello, World!" might look like:
    helloGrammar = Word(alphas) + "," + Word(alphas) + oneOf(". ! ? !! !!!")
which could then parse any of:
    Hello, World!
    Hello  ,   World   !
    Hello,World!
    Yo, Adrian!!!
    Hey, man.
    Whattup, dude?

You can associate field names with specific parse elements, so that the
fields can be extracted from the results such as:
    helloGrammar = Word(alphas).setResultsName("greeting") + "," + \
        Word(alphas).setResultsName("to") + oneOf(". ! ? !! !!!")
    results = helloGrammar.parseString( greetingstring )
    print results.greeting
    print results.to

You can associate parse actions (a la SAX) to fire when matching parse
elements are matched in the input.

You can find the pyparsing home page at http://pyparsing.sourceforge.net.

-- Paul McGuire





More information about the Python-list mailing list