Named regexp variables, an extension proposal.

Sat May 13 12:38:42 EDT 2006

"Paddy" <paddy3118 at netscape.net> wrote in message
news:1147513160.977268.253690 at j33g2000cwa.googlegroups.com...
> Proposal: Named RE variables
> ======================
>
> The problem I have is that I am writing a 'good-enough' verilog tag
> extractor as a long regular expression (with the 'x' flag for
> readability), and find myself both
>  1) Repeating sections of the RE, and
>  2) Wanting to add '(?P<some_clarifier>...) ' around sections
>      because I know what the section does but don't really want
>      the group.
>
> If I could write:
>  (?P/verilog_name/ [A-Za-z_][A-Za-z_0-9\$\.]* | \\\S+ )
>
> ...and have the RE parser extract the section of RE after the second
> '/' and store it associated with its name that appears between the
> first two '/'. The RE should NOT try and match against anything between
> the outer '(' ')' pair at this point, just store.
>
> Then the following code appearing later in the RE:
>   (?P=verilog_name)
>
> ...should retrieve the RE snippet named and insert it into the RE
> instead of the '(?P=...)' group before interpreting the RE 'as normal'
>
> Instead of writing the following to search for event declarations:
>   vlog_extract = r'''(?smx)
>     # Verilog event definition extraction
>     (?: event \s+ [A-Za-z_][A-Za-z_0-9\$\.]* \s* (?: , \s*
> [A-Za-z_][A-Za-z_0-9\$\.]*)* )
>   '''
> I could write the following RE, which I think is clearer:
>   vlog_extract = r'''(?smx)
>     # Verilog identifier definition
>     (?P/IDENT/ [A-Za-z_][A-Za-z_0-9\$\.]* (?!\.) )
>     # Verilog event definition extraction
>     (?: event \s+ (?P=IDENT) \s* (?: , \s* (?P=IDENT))* )
>   '''
>

By contrast, the event declaration expression in the pyparsing Verilog
parser is:

identLead = alphas+"$_"
identBody = alphanums+"$_"
#~ identifier = Combine( Optional(".") +
#~                       delimitedList( Word(identLead, identBody), ".",
combine=True ) ).setName("baseIdent")
# replace pyparsing composition with Regex - improves performance ~10% for
this construct
identifier = Regex(
r"\.?["+identLead+"]["+identBody+"]*(\.["+identLead+"]["+identBody+"]*)*" ).
setName("baseIdent")

eventDecl = Group( "event" + delimitedList( identifier ) + semi )

But why do you need an update to RE to compose snippets?  Especially
snippets that you can only use in the same RE?  Just do string interp:

> I could write the following RE, which I think is clearer:
>   vlog_extract = r'''(?smx)
>     # Verilog identifier definition
>     (?P/IDENT/ [A-Za-z_][A-Za-z_0-9\$\.]* (?!\.) )
>     # Verilog event definition extraction
>     (?: event \s+ (?P=IDENT) \s* (?: , \s* (?P=IDENT))* )
>   '''
IDENT = "[A-Za-z_][A-Za-z_0-9\$\.]* (?!\.)"
vlog_extract = r'''(?smx)
  # Verilog event definition extraction
  (?: event \s+ %(IDENT)s \s* (?: , \s* %(IDENT)s)* )
  ''' % locals()

Yuk, this is a mess - which '%' signs are part of RE and which are for
string interp?  Maybe just plain old string concat is better:

IDENT = "[A-Za-z_][A-Za-z_0-9\$\.]* (?!\.)"
vlog_extract = r'''(?smx)
  # Verilog event definition extraction
  (?: event \s+ ''' + IDENT + ''' \s* (?: , \s* ''' + IDENT + ''')* )'''

By the way, your IDENT is not totally accurate - it does not permit a
leading ".", and it does permit leading digits in identifier elements after
the first ".".  So ".goForIt" would not be matched as a valid identifier
when it should, and "go.4it" would be matched as valid when it shouldn't (at
least as far as I read the Verilog grammar).

(Pyparsing (http://sourceforge.net/projects/pyparsing/) is open source under
the MIT license.  The Verilog grammar is not distributed with pyparsing, and
is only available free for noncommercial use.)

-- Paul