Python 3 regex woes (parsing ISC DHCPD config)

Thomas 'PointedEars' Lahn PointedEars at web.de
Tue Jan 13 06:43:07 EST 2015


Jason Bailey wrote:

> My script first reads the DHCPD configuration file into memory -
> variable "filebody". It then utilizes the re module to find the
> configuration details for the wanted "shared network".
> 
> The config file might look something like this:
> 
> ######################################
> 
> shared-network My-Network-MOHE {
>    subnet 192.168.0.0 netmask 255.255.248.0 {
>      option routers 192.168.0.1;
>      option tftp-server-name "192.168.90.12";
>      pool {
>        deny dynamic bootp clients;
>        range 192.168.0.20 192.168.7.254;
>      }
>    }
> }
> 
> shared-network My-Network-CDCO {
>    subnet 192.168.8.0 netmask 255.255.248.0 {
>      option routers 10.101.8.1;
>      option tftp-server-name "192.168.90.12";
>      pool {
>        deny dynamic bootp clients;
>        range 192.168.8.20 192.168.15.254;
>      }
>    }
> }
> 
> shared-network My-Network-FECO {
>    subnet 192.168.16.0 netmask 255.255.248.0 {
>      option routers 192.168.16.1;
>      option tftp-server-name "192.168.90.12";
>      pool {
>        deny dynamic bootp clients;
>        range 192.168.16.20 192.168.23.254;
>      }
>    }
> }
> 
> ######################################
> 
> Suppose I'm trying to grab the shared network called "My-Network-FECO"
> from the above config file stored in the variable 'filebody'.
> 
> First I have my variable 'shared_network' which contains the string
> "My-Network-FECO".
> 
> I compile my regex:
> m = re.compile(r"^(shared\-network (" + re.escape(shared_network) + r")
> \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE)

This code does not run as posted.  Applying Occam’s Razor, I think you meant 
to post

m = re.compile(r"^(shared\-network ("
  + re.escape(shared_network)
  + r") \{((\n|.|\r\n)*?)(^\}))", re.MULTILINE|re.UNICODE)

(If you post long lines, know where your automatic word wrap happens.)

> I search for regex matches in my config file:
> m.search(filebody)

I find using the identifier “m” for the expression very strange.  Usually I 
reserve “m” to hold the *matches* for an expression on a string.
Consider “r” or “rx” or something else instead of “m” for the expression.

> Unfortunately, I get no matches. From output on the command line, I can
> see that Python is adding extra backslashes to my re.compile string. I
> have added the raw 'r' in front of the strings to prevent it, but to no
> avail.

Python is adding the extra backslashes because you used “r”.  Note that the 
console-printed string representations of strings do not have an “r” in 
front of them.  What you see is what you would have needed to write for 
equivalent code if you had not used “r”.  (Different from some other 
languages, Python does not distinguish between single-quoted and double-
quoted strings with regard to parsing.  Hence the r'…' feature, the triple-
quoted string, and the .format() method.)

You get no matches because you have escaped the HYPHEN-MINUSes (“-”).  You 
never need to escape those characters, in fact you must not do that here 
because r'\-' is not an (unnecessarily) escaped HYPHEN-MINUS, it is a 
literal backslash followed by a HYPHEN-MINUS, a character sequence that does 
not occur in your string.  Outside of a character class you do not need to 
do that, and in a character class you can put it as first or last character 
instead (“[-…]” or “[…-]”).

You have escaped the first HYPHEN-MINUS; re.escape() has escaped the other 
two for you:

| >>> re.escape('-')
| '\\-'

I presume this behavior is because of character classes, and the idea that 
the return value should work at any position in a character class.

ISTM that you cannot use re.escape() here, and you must escape special 
characters yourself (using re.sub()), should they be possible in the file.

I do not see a reason for making the entire expression a group (but for 
making the network name a group).  

You should refrain from parsing non-regular languages with a *single* 
regular expression (multiple expressions or expressions with alternation in 
a loop are usually fine; this can be used for building efficient parsers), 
even though Python’s regular expressions, which are not an exception there, 
are not exactly “regular” in the theoretical computer science sense.  See 
the Chomsky hierarchy and Jeffrey E. F. Friedl’s insightful textbook 
“Mastering Regular Expressions”.

It is possible that there is a Python module for parsing ISC dhcpd 
configuration files already.  If so, you should use that instead.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.



More information about the Python-list mailing list