A nice way to use regex for complicate parsing

aspineux aspineux at gmail.com
Fri Mar 30 09:23:17 EDT 2007


On 29 mar, 17:33, "Paul McGuire" <p... at austin.rr.com> wrote:
> On Mar 29, 9:42 am, Shane Geiger <sgei... at ncee.net> wrote:
>
> > It would be worth learning pyparsing to do this.
>
> Thanks to Shane and Steven for the ref to pyparsing.  I also was
> struck by this post, thinking "this is pyparsing written in re's and
> dicts".

My first idea was : why learn a parsing library if I can do it using
're'
and dicts :-)

>
> The approach you are taking is *very* much like the thought process I
> went through when first implementing pyparsing.  I wanted to easily
> compose expressions from other expressions.  In your case, you are
> string interpolating using a cumulative dict of prior expressions.
> Pyparsing uses various subclasses of the ParserElement class, with
> operator definitions for alternation ("|" or "^" depending on non-
> greedy vs. greedy), composition ("+"), and negation ("~").  Pyparsing
> also uses its own extended results construct, ParseResults, which
> supports named results fields, accessible using list indicies, dict
> names, or instance names.
>
> Here is the pyparsing treatment of your example (I may not have gotten
> every part correct, but my point is more the similarity of our
> approaches).  Note the access to the smtp parameters via the Dict
> transformer.
>
> -- Paul

Thanks !

Any parsing library I used before were heavy to start with.
The benefit was inversely proportional to the size of the project.
Your look to be lighter, and the results are more easily usable.

Thanks for showing me your lib.

Anyway today I will keep my idea for small parsing.


Alain


>
> from pyparsing import *
>
> # <dotnum> ::= <snum> "." <snum> "." <snum> "." <snum>
> intgr = Word(nums)
> dotnum = Combine(intgr + "." + intgr + "." + intgr + "." + intgr)
>
> # <dot-string> ::= <string> | <string> "." <dot-string>
> string_ = Word(alphanums)
> dotstring = Combine(delimitedList(string_,"."))
>
> # <domain> ::=  <element> | <element> "." <domain>
> domain = dotnum | dotstring
>
> # <q> ::= any one of the 128 ASCII characters except <CR>, <LF>, quote
> ("), or backslash (\)
> # <x> ::= any one of the 128 ASCII characters (no exceptions)
> # <qtext> ::=  "\" <x> | "\" <x> <qtext> | <q> | <q> <qtext>
> # <quoted-string> ::=  """ <qtext> """
> quotedString = dblQuotedString  # <- just use pre-defined expr from
> pyparsing
>
> # <local-part> ::= <dot-string> | <quoted-string>
> localpart = (dotstring | quotedString).setResultsName("localpart")
>
> # <mailbox> ::= <local-part> "@" <domain>
> mailbox = Combine(localpart + "@" + domain).setResultsName("mailbox")
>
> # <path> ::= "<" [ <a-d-l> ":" ] <mailbox> ">"
> # also accept address without <>
> path = "<" + mailbox + ">" | mailbox
>
> # esmtp-keyword    ::= (ALPHA / DIGIT) *(ALPHA / DIGIT / "-")
> esmtpkeyword = Word(alphanums,alphanums+"-")
>
> # esmtp-value      ::= 1*<any CHAR excluding "=", SP, and all
> esmtpvalue = Regex(r'[^= \t\r\n\f\v]*')
>
> # ; syntax and values depend on esmtp-keyword
> #                      control characters (US ASCII 0-31inclusive)>
> # esmtp-parameter  ::= esmtp-keyword ["=" esmtp-value]
> # esmtp-parameter  ::= esmtp-keyword ["=" esmtp-value]
> esmtpparameters = Dict(
>     ZeroOrMore( Group(esmtpkeyword + Suppress("=") + esmtpvalue) ) )
>
> # esmtp-cmd        ::= inner-esmtp-cmd [SP esmtp-parameters] CR LF
> esmtp_addr = path + \
>                 Optional(esmtpparameters,default=[])\
>                 .setResultsName("parameters")
>
> for t in tests:
>         for keyword in [ 'MAIL FROM:', 'RCPT TO:' ]:
>                 keylen=len(keyword)
>                 if t[:keylen].upper()==keyword:
>                         t=t[keylen:]
>                 break
>
>         try:
>             match = esmtp_addr.parseString(t)
>             print 'MATCH'
>             print match.dump()
>             # some sample code to access elements of the parameters
> "dict"
>             if "SIZE" in match.parameters:
>                 print "SIZE is", match.parameters.SIZE
>             print
>         except ParseException,pe:
>             print 'DONT match', t
>
> prints:
> MATCH
> ['<', ['johnsmith at addresscom'], '>']
> - mailbox: ['johnsmith at addresscom']
>   - localpart: johnsmith
> - parameters: []
>
> MATCH
> [['johnsmith at addresscom']]
> - mailbox: ['johnsmith at addresscom']
>   - localpart: johnsmith
> - parameters: []
>
> MATCH
> ['<', ['johnsmith at addresscom'], '>', ['SIZE', '1234'], ['OTHER',
> '... at bar.com']]
> - OTHER: f... at bar.com
> - SIZE: 1234
> - mailbox: ['johnsmith at addresscom']
>   - localpart: johnsmith
> - parameters: [['SIZE', '1234'], ['OTHER', '... at bar.com']]
>   - OTHER: f... at bar.com
>   - SIZE: 1234
> SIZE is 1234
>
> MATCH
> [['johnsmith at addresscom'], ['SIZE', '1234'], ['OTHER', '... at bar.com']]
> - OTHER: f... at bar.com
> - SIZE: 1234
> - mailbox: ['johnsmith at addresscom']
>   - localpart: johnsmith
> - parameters: [['SIZE', '1234'], ['OTHER', '... at bar.com']]
>   - OTHER: f... at bar.com
>   - SIZE: 1234
> SIZE is 1234
>
> MATCH
> ['<', ['"t... at is.a> legal=email"@addresscom'], '>']
> - mailbox: ['"t... at is.a> legal=email"@addresscom']
>   - localpart: "t... at is.a> legal=email"
> - parameters: []
>
> MATCH
> [['"t... at is.a> legal=email"@addresscom']]
> - mailbox: ['"t... at is.a> legal=email"@addresscom']
>   - localpart: "t... at is.a> legal=email"
> - parameters: []
>
> MATCH
> ['<', ['"t... at is.a> legal=email"@addresscom'], '>', ['SIZE', '1234'],
> ['OTHER', '... at bar.com']]
> - OTHER: f... at bar.com
> - SIZE: 1234
> - mailbox: ['"t... at is.a> legal=email"@addresscom']
>   - localpart: "t... at is.a> legal=email"
> - parameters: [['SIZE', '1234'], ['OTHER', '... at bar.com']]
>   - OTHER: f... at bar.com
>   - SIZE: 1234
> SIZE is 1234
>
> MATCH
> [['"t... at is.a> legal=email"@addresscom'], ['SIZE', '1234'], ['OTHER',
> '... at bar.com']]
> - OTHER: f... at bar.com
> - SIZE: 1234
> - mailbox: ['"t... at is.a> legal=email"@addresscom']
>   - localpart: "t... at is.a> legal=email"
> - parameters: [['SIZE', '1234'], ['OTHER', '... at bar.com']]
>   - OTHER: f... at bar.com
>   - SIZE: 1234
> SIZE is 1234





More information about the Python-list mailing list