Parsing a file with iterators

Fri Oct 17 16:44:47 EDT 2008

On Oct 17, 10:42 am, Luis Zarrabeitia <ky... at uh.cu> wrote:
> I need to parse a file, text file. The format is something like that:
>
> TYPE1 metadata
> data line 1
> data line 2
> ...
> data line N
> TYPE2 metadata
> data line 1
> ...
> TYPE3 metadata
> ...
>
> And so on. The type and metadata determine how to parse the following data
> lines. When the parser fails to parse one of the lines, the next parser is
> chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).
>
<snip>

Pyparsing will take care of this for you, if you define a set of
alternatives and then parse/search for them.  Here is an annotated
example.  Note the ability to attach names to different fields of the
parser, and then how those fields are accessed after parsing.

"""
TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...
"""

from pyparsing import *

# define basic element types to be used in data formats
integer = Word(nums)
ident = Word(alphas) | quotedString.setParseAction(removeQuotes)
zipcode = Combine(Word(nums,exact=5) + Optional("-" +
Word(nums,exact=4)))
stateAbbreviation = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE
    FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS
    MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT
    VA VI VT WA WI WV WY""".split())

# define data format for each type
DATA = Suppress("data")
type1dataline = Group(DATA + OneOrMore(integer))
type2dataline = Group(DATA + delimitedList(ident))
type3dataline = DATA + countedArray(ident)

# define complete expressions for each type - note different types
# may have different metadata
type1data = "TYPE1" + ident("name") + \
    OneOrMore(type1dataline)("data")
type2data = "TYPE2" + ident("name") + zipcode("zip") + \
    OneOrMore(type2dataline)("data")
type3data = "TYPE3" + ident("name") + stateAbbreviation("state") + \
    OneOrMore(type3dataline)("data")

# expression containing all different type alternatives
data = type1data | type2data | type3data

# search a test input string and dump the matched tokens by name
testInput = """
TYPE1 Abercrombie
data 400 26 42 66
data 1 1 2 3 5 8 13 21
data 1 4 9 16 25 36
data 1 2 4 8 16 32 64
TYPE2 Benjamin 78704
data Larry, Curly, Moe
data Hewey,Dewey ,Louie
data Tom  , Dick, Harry, Fred
data Thelma,Louise
TYPE3 Christopher WA
data 3 "Raspberry Red" "Lemon Yellow" "Orange Orange"
data 7 Grumpy Sneezy Happy Dopey Bashful Sleepy Doc
"""
for tokens in data.searchString(testInput):
    print tokens.dump()
    print tokens.name
    if tokens.state: print tokens.state
    for d in tokens.data:
        print " ",d
    print

Prints:

['TYPE1', 'Abercrombie', ['400', '26', '42', '66'], ['1', '1', '2',
'3', '5', '8', '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1',
'2', '4', '8', '16', '32', '64']]
- data: [['400', '26', '42', '66'], ['1', '1', '2', '3', '5', '8',
'13', '21'], ['1', '4', '9', '16', '25', '36'], ['1', '2', '4', '8',
'16', '32', '64']]
- name: Abercrombie
Abercrombie
  ['400', '26', '42', '66']
  ['1', '1', '2', '3', '5', '8', '13', '21']
  ['1', '4', '9', '16', '25', '36']
  ['1', '2', '4', '8', '16', '32', '64']

['TYPE2', 'Benjamin', '78704', ['Larry', 'Curly', 'Moe'], ['Hewey',
'Dewey', 'Louie'], ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma',
'Louise']]
- data: [['Larry', 'Curly', 'Moe'], ['Hewey', 'Dewey', 'Louie'],
['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma', 'Louise']]
- name: Benjamin
- zip: 78704
Benjamin
  ['Larry', 'Curly', 'Moe']
  ['Hewey', 'Dewey', 'Louie']
  ['Tom', 'Dick', 'Harry', 'Fred']
  ['Thelma', 'Louise']

['TYPE3', 'Christopher', 'WA', ['Raspberry Red', 'Lemon Yellow',
'Orange Orange'], ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful',
'Sleepy', 'Doc']]
- data: [['Raspberry Red', 'Lemon Yellow', 'Orange Orange'],
['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']]
- name: Christopher
- state: WA
Christopher
WA
  ['Raspberry Red', 'Lemon Yellow', 'Orange Orange']
  ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']

More info on pyparsing at http://pyparsing.wikispaces.com.

-- Paul