Python 3 regex woes (parsing ISC DHCPD config)

Thomas 'PointedEars' Lahn PointedEars at web.de
Tue Jan 13 11:09:36 EST 2015


Thomas 'PointedEars' Lahn wrote:

> Jason Bailey wrote:
>> shared-network My-Network-MOHE {
>>    […] {
>>
>> I compile my regex:
>> m = re.compile(r"^(shared\-network (" + re.escape(shared_network) + r")
>> \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE)
> 
> This code does not run as posted.  Applying Occam’s Razor, I think you
> meant to post
> 
> m = re.compile(r"^(shared\-network ("
>   + re.escape(shared_network)
>   + r") \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE)
> 
> […]
> You get no matches because you have escaped the HYPHEN-MINUSes (“-”).  You
> never need to escape those characters, in fact you must not do that here
> because r'\-' is not an (unnecessarily) escaped HYPHEN-MINUS, it is a
> literal backslash followed by a HYPHEN-MINUS, a character sequence that
> does not occur in your string.  Outside of a character class you do not
> need to do that, and in a character class you can put it as first or last
> character instead (“[-…]” or “[…-]”).
> 
> You have escaped the first HYPHEN-MINUS; re.escape() has escaped the other
> two for you:
> 
> | >>> re.escape('-')
> | '\\-'
> 
> I presume this behavior is because of character classes, and the idea that
> the return value should work at any position in a character class.

It would appear that while my answer is not entirely wrong, the first 
sentence of that section is.  You may escape the HYPHEN-MINUS there, and may 
use re.escape(); it has no effect on the expression because of what I said 
following that sentence.  One must consider that the string is first parsed 
by Python’s string parser and then by Python’s re parser.

So I have presently no specific idea why you get no matches, however

  r'\{((\n|.|\r\n)*?)(^\}'

is not a proper way to match matching braces and everything in-between.

To begin with, the proper expression to match any newline is r'(\r?\n|\r)' 
because the first matching alternative in an alternation, not the longest 
one, wins.  But if you specify re.DOTALL, you can simply use “.” for any 
character (including any newline combination).
 
> […]
> You should refrain from parsing non-regular languages with a *single*
> regular expression (multiple expressions or expressions with alternation
> in a loop are usually fine; this can be used for building efficient
> parsers), even though Python’s regular expressions, which are not an
> exception there,
> are not exactly “regular” in the theoretical computer science sense.  See
> the Chomsky hierarchy and Jeffrey E. F. Friedl’s insightful textbook
> “Mastering Regular Expressions”.

And for matching matching braces (sic!) with regular expressions, you need a 
recursive one (which is another extension of regular expressions as they are 
discussed in CS).  Or a parser in the first place.  Otherwise you match too 
much with greedy matching

  { { } } { { } }
  ^-------------^

or too little with non-greedy matching

  { { } } { { } }
  ^---^

CS regular expressions can be used to describe *regular* languages (Chomsky-
type 3).  Bracket languages are, in general, not regular (see “pumping lemma 
for regular languages”), so for them you need an PDA¹-like extension of CS 
regular expressions (the aforementioned recursive ones), or a PDA 
implementation in the first place.  Such a PDA implementation is part of a 
parser.

____
¹  <https://en.wikipedia.org/wiki/Pushdown_automaton>
-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.



More information about the Python-list mailing list