advanced regex, was: Re: scanf style parsing

Thu Oct 4 12:54:05 EDT 2001

hpj at urpla.net (Hans-Peter Jansen) writes:
> Well, yesterday, I tried to parse some simple hexdump, produced by
> tcpdump -xs1500 port 80. The idea was, filter the hexcodes, and display
> and 7 bit acsii codes like a little advanced hex monitors do.
> 
> As I'm fairly new to advanced regex constructs, would somebody enlight
> me, how to efficiently parse lines like:
> 
>                  2067 726f 7570 732e 2e2e 3c2f 613e 3c2f
>                  666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c
>                  7472 3e3c 7464 2062 6763 6f6c 6f72 3d23
>                  6666 6363 3333 2063 6f6c 7370 616e 3d34
>                  3e3c 494d 4720 6865 6967 6874 3d31 2073
>                  7263 3d22 2f69 6d61 6765 732f 636c 6561
>                  7264 6f74 2e67 6966 2220 7769 6474 683d
>                  3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74
>                  6162 6c65 3e3c 703e 3c66 6f6e 7420 7369
>                  7a65 3d2d 313e 4172 6520 796f 7520 6120
> 
> with respect to varying column numbers. I will refrain to 
> show my stupid beginnings, but I wasn't able to get that _one_
> regex right, with all columns in matchobj.groups() listed.
> 
> new-in-regexing-ly, yr's
> Hans-Peter
> 
> P.S.: I ended up in a "simple" c based filter...
> Please CC me

Hi Hans-Peter,

You're asking how to use a regex to parse your hexdump, with an eye
towards displaying the ascii representation. I don't know if regex is
what you want to do the latter. Here is some example code that might
help.

import re

hexpat = re.compile ('[a-f0-9]{4}')

# your first line of the hexdump, stripped

line = '2067 726f 7570 732e 2e2e 3c2f 613e 3c2fp'
hexpat.search (line).span ()

-> (0, 4)

hexpat.search (line[4:])

-> (1, 5)

As to the getting your ascii...

import operator

def hex2ascii (hexstr):
    """hex2ascii (hexstr) -> ascii rep of 4 character hex string"""
    # error checking here, please!
    return chr (int (hexstr[:2], 16)) + chr (int (hexstr[2:], 16))

# slurp your hexdump by line (your example is stored in hexdat, by line)
# stripping off the leading whitespace

hexdat = map (lambda x: x.strip (), open ("dumpfile").readlines ())

for i in hexdat:
    print i, reduce (operator.add, map (hex2ascii, i.split ()))

->
2067 726f 7570 732e 2e2e 3c2f 613e 3c2f  groups...</a></
666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c font></td></tr><
7472 3e3c 7464 2062 6763 6f6c 6f72 3d23 tr><td bgcolor=#
6666 6363 3333 2063 6f6c 7370 616e 3d34 ffcc33 colspan=4
3e3c 494d 4720 6865 6967 6874 3d31 2073 ><IMG height=1 s
7263 3d22 2f69 6d61 6765 732f 636c 6561 rc="/images/clea
7264 6f74 2e67 6966 2220 7769 6474 683d rdot.gif" width=
3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74 1 ></td></tr></t
6162 6c65 3e3c 703e 3c66 6f6e 7420 7369 able><p><font si
7a65 3d2d 313e 4172 6520 796f 7520 6120 ze=-1>Are you a 

Hope this helps, and critique most welcome...

G
-- 
George Demmy
Layton Graphics, Inc
Marietta, Georgia