advanced regex, was: Re: scanf style parsing

Skip Montanaro skip at pobox.com
Thu Oct 4 22:14:41 EDT 2001


    Hans-Peter> Well, yesterday, I tried to parse some simple hexdump,
    Hans-Peter> produced by tcpdump -xs1500 port 80. The idea was, filter
    Hans-Peter> the hexcodes, and display and 7 bit acsii codes like a
    Hans-Peter> little advanced hex monitors do.

    Hans-Peter> As I'm fairly new to advanced regex constructs, would
    Hans-Peter> somebody enlight me, how to efficiently parse lines like:

    Hans-Peter>    2067 726f 7570 732e 2e2e 3c2f 613e 3c2f
    Hans-Peter>    666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c
    Hans-Peter>    7472 3e3c 7464 2062 6763 6f6c 6f72 3d23
    Hans-Peter>    6666 6363 3333 2063 6f6c 7370 616e 3d34
    Hans-Peter>    3e3c 494d 4720 6865 6967 6874 3d31 2073
    Hans-Peter>    7263 3d22 2f69 6d61 6765 732f 636c 6561
    Hans-Peter>    7264 6f74 2e67 6966 2220 7769 6474 683d
    Hans-Peter>    3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74
    Hans-Peter>    6162 6c65 3e3c 703e 3c66 6f6e 7420 7369
    Hans-Peter>    7a65 3d2d 313e 4172 6520 796f 7520 6120

    Hans-Peter> with respect to varying column numbers. I will refrain to
    Hans-Peter> show my stupid beginnings, but I wasn't able to get that
    Hans-Peter> _one_ regex right, with all columns in matchobj.groups()
    Hans-Peter> listed.

I'm not sure quite what you're looking for, but this data is so regular I
wouldn't use regular expressions to parse it (no pun intended).

Assuming the above stream is coming in on stdin and I wanted to display
any printable ASCII characters, I'd start with something like this:

    import sys

    for line in sys.stdin.readlines():
        line = line.strip()
        fields = line.split()
        printing = []
        for pair in fields:
            first = chr(int(pair[:2], 16))
            second = chr(int(pair[2:], 16))
            if first < " " or first > "~":
                first = "."
            if second < " " or second > "~":
                second = "."
            printing.extend([first, second])
        print line, "".join(printing)

The above hex data fed to this code produces

    2067 726f 7570 732e 2e2e 3c2f 613e 3c2f  groups...</a></
    666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c font></td></tr><
    7472 3e3c 7464 2062 6763 6f6c 6f72 3d23 tr><td bgcolor=#
    6666 6363 3333 2063 6f6c 7370 616e 3d34 ffcc33 colspan=4
    3e3c 494d 4720 6865 6967 6874 3d31 2073 ><IMG height=1 s
    7263 3d22 2f69 6d61 6765 732f 636c 6561 rc="/images/clea
    7264 6f74 2e67 6966 2220 7769 6474 683d rdot.gif" width=
    3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74 1 ></td></tr></t
    6162 6c65 3e3c 703e 3c66 6f6e 7420 7369 able><p><font si
    7a65 3d2d 313e 4172 6520 796f 7520 6120 ze=-1>Are you a 

on stdout.

-- 
Skip Montanaro (skip at pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/




More information about the Python-list mailing list