Python 3 regex woes (parsing ISC DHCPD config)
Thomas 'PointedEars' Lahn
PointedEars at web.de
Tue Jan 13 11:09:36 EST 2015
Thomas 'PointedEars' Lahn wrote:
> Jason Bailey wrote:
>> shared-network My-Network-MOHE {
>> […] {
>>
>> I compile my regex:
>> m = re.compile(r"^(shared\-network (" + re.escape(shared_network) + r")
>> \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE)
>
> This code does not run as posted. Applying Occam’s Razor, I think you
> meant to post
>
> m = re.compile(r"^(shared\-network ("
> + re.escape(shared_network)
> + r") \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE)
>
> […]
> You get no matches because you have escaped the HYPHEN-MINUSes (“-”). You
> never need to escape those characters, in fact you must not do that here
> because r'\-' is not an (unnecessarily) escaped HYPHEN-MINUS, it is a
> literal backslash followed by a HYPHEN-MINUS, a character sequence that
> does not occur in your string. Outside of a character class you do not
> need to do that, and in a character class you can put it as first or last
> character instead (“[-…]” or “[…-]”).
>
> You have escaped the first HYPHEN-MINUS; re.escape() has escaped the other
> two for you:
>
> | >>> re.escape('-')
> | '\\-'
>
> I presume this behavior is because of character classes, and the idea that
> the return value should work at any position in a character class.
It would appear that while my answer is not entirely wrong, the first
sentence of that section is. You may escape the HYPHEN-MINUS there, and may
use re.escape(); it has no effect on the expression because of what I said
following that sentence. One must consider that the string is first parsed
by Python’s string parser and then by Python’s re parser.
So I have presently no specific idea why you get no matches, however
r'\{((\n|.|\r\n)*?)(^\}'
is not a proper way to match matching braces and everything in-between.
To begin with, the proper expression to match any newline is r'(\r?\n|\r)'
because the first matching alternative in an alternation, not the longest
one, wins. But if you specify re.DOTALL, you can simply use “.” for any
character (including any newline combination).
> […]
> You should refrain from parsing non-regular languages with a *single*
> regular expression (multiple expressions or expressions with alternation
> in a loop are usually fine; this can be used for building efficient
> parsers), even though Python’s regular expressions, which are not an
> exception there,
> are not exactly “regular” in the theoretical computer science sense. See
> the Chomsky hierarchy and Jeffrey E. F. Friedl’s insightful textbook
> “Mastering Regular Expressions”.
And for matching matching braces (sic!) with regular expressions, you need a
recursive one (which is another extension of regular expressions as they are
discussed in CS). Or a parser in the first place. Otherwise you match too
much with greedy matching
{ { } } { { } }
^-------------^
or too little with non-greedy matching
{ { } } { { } }
^---^
CS regular expressions can be used to describe *regular* languages (Chomsky-
type 3). Bracket languages are, in general, not regular (see “pumping lemma
for regular languages”), so for them you need an PDA¹-like extension of CS
regular expressions (the aforementioned recursive ones), or a PDA
implementation in the first place. Such a PDA implementation is part of a
parser.
____
¹ <https://en.wikipedia.org/wiki/Pushdown_automaton>
--
PointedEars
Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.
More information about the Python-list
mailing list