Need help parsing with pyparsing...

Paul McGuire ptmcg at austin.rr.com
Mon Oct 22 12:22:38 EDT 2007


On Oct 22, 4:18 am, "Just Another Victim of the Ambient Morality"
<ihates... at hotmail.com> wrote:
>     I'm trying to parse with pyparsing but the grammar I'm using is somewhat
> unorthodox.  I need to be able to parse something like the following:
>
> UPPER CASE WORDS And Title Like Words
>
>     ...into two sentences:
>
> UPPER CASE WORDS
> And Title Like Words
>
>     I'm finding this surprisingly hard to do.  The problem is that pyparsing
> implicitly assumes whitespace are ignorable characters and is (perhaps
> necessarily) greedy with its term matching.  All attempts to do the
> described parsing either fails to parse or incorrectly parses so:
>
> UPPER CASE WORDS A
> nd Title Like Words
>
>     Frankly, I'm stuck.  I don't know how to parse this grammar with
> pyparsing.
>     Does anyone know how to accomplish what I'm trying to do?
>     Thank you...

Yes, whitespace skipping does get in the way sometimes.  In your case,
you need to clarify that each word that is parsed must be followed by
whitespace.  See the options and comments in the code below:

from pyparsing import *

data = "UPPER CASE WORDS And Title Like Words"

# Option 1 - qualify Word instance with asKeyword=True
upperCaseWord = Word(alphas.upper(), asKeyword=True)
titleLikeWord = Word(alphas.upper(), alphas.lower(), asKeyword=True)

# Option 2 - explicitly state that each word must be followed by
whitespace
upperCaseWord = Word(alphas.upper()) + FollowedBy(White())
titleLikeWord = Word(alphas.upper(), alphas.lower()) +
FollowedBy(White())

# Option 3 - use regex's - note, still have to use lookahead to avoid
matching
# 'A' in 'And'
upperCaseWord = Regex(r"[A-Z]+(?=\s)")
titleLikeWord = Regex(r"[A-Z][a-z]*")

# create grammar, with some friendly results names
grammar = (OneOrMore(upperCaseWord)("allCaps") +
           OneOrMore(titleLikeWord)("title"))

# dump out the parsed results
print grammar.parseString(data).dump()


All three options print out:

['UPPER', 'CASE', 'WORDS', 'And', 'Title', 'Like', 'Words']
- allCaps: ['UPPER', 'CASE', 'WORDS']
- title: ['And', 'Title', 'Like', 'Words']

Once you have this, you can rejoin the words with " ".join, or
whatever you like.

-- Paul




More information about the Python-list mailing list