regex question

Wed Feb 13 09:29:20 EST 2008

On Feb 13, 6:53 am, mathieu <mathieu.malate... at gmail.com> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = "      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width          SL   1 "
> patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
<snip>

I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = \
"      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings  " \
"Auto Window Width          SL   1 "

patt = re.compile(
    "^\s*"
    "\("
        "([0-9A-Z]+),"
        "([0-9A-Zx]+)"
    "\)\s+"
    "([A-Za-z0-9./:_ -]+)\s\s+"
    "([A-Za-z0-9 ()._,/#>-]+)\s+"
    "([A-Z][A-Z]_?O?W?)\s+"
    "([0-9n-]+)\s*$")

Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:

patt = re.compile(
    "^\s*"
    "\("
        "([0-9A-Z]+),"
        "([0-9A-Zx]+)"
    "\)\s+"
    "([A-Za-z0-9./:_ -]+?)\s\s+"
    "([A-Za-z0-9 ()._,/#>-]+)\s+"
    "([A-Z][A-Z]_?O?W?)\s+"
    "([0-9n-]+)\s*$")

or if you prefer:

patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

      (xx42,xx0A)   Honeywell: Inverse Flitznoid (Kelvin)
80          SL   1

Just out of curiosity, I wondered what a pyparsing version of this
would look like.  See below:

from pyparsing import Word,hexnums,delimitedList,printables,\
    White,Regex,nums

line = \
"      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings  " \
"Auto Window Width          SL   1 "

# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
                        delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")

# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
    text("desc") + text("window") + type_label("type") + \
    int_label("int")

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul