advanced regex, was: Re: scanf style parsing
George Demmy
gdemmy at layton-graphics.com
Thu Oct 4 12:54:05 EDT 2001
hpj at urpla.net (Hans-Peter Jansen) writes:
> Well, yesterday, I tried to parse some simple hexdump, produced by
> tcpdump -xs1500 port 80. The idea was, filter the hexcodes, and display
> and 7 bit acsii codes like a little advanced hex monitors do.
>
> As I'm fairly new to advanced regex constructs, would somebody enlight
> me, how to efficiently parse lines like:
>
> 2067 726f 7570 732e 2e2e 3c2f 613e 3c2f
> 666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c
> 7472 3e3c 7464 2062 6763 6f6c 6f72 3d23
> 6666 6363 3333 2063 6f6c 7370 616e 3d34
> 3e3c 494d 4720 6865 6967 6874 3d31 2073
> 7263 3d22 2f69 6d61 6765 732f 636c 6561
> 7264 6f74 2e67 6966 2220 7769 6474 683d
> 3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74
> 6162 6c65 3e3c 703e 3c66 6f6e 7420 7369
> 7a65 3d2d 313e 4172 6520 796f 7520 6120
>
> with respect to varying column numbers. I will refrain to
> show my stupid beginnings, but I wasn't able to get that _one_
> regex right, with all columns in matchobj.groups() listed.
>
> new-in-regexing-ly, yr's
> Hans-Peter
>
> P.S.: I ended up in a "simple" c based filter...
> Please CC me
Hi Hans-Peter,
You're asking how to use a regex to parse your hexdump, with an eye
towards displaying the ascii representation. I don't know if regex is
what you want to do the latter. Here is some example code that might
help.
import re
hexpat = re.compile ('[a-f0-9]{4}')
# your first line of the hexdump, stripped
line = '2067 726f 7570 732e 2e2e 3c2f 613e 3c2fp'
hexpat.search (line).span ()
-> (0, 4)
hexpat.search (line[4:])
-> (1, 5)
As to the getting your ascii...
import operator
def hex2ascii (hexstr):
"""hex2ascii (hexstr) -> ascii rep of 4 character hex string"""
# error checking here, please!
return chr (int (hexstr[:2], 16)) + chr (int (hexstr[2:], 16))
# slurp your hexdump by line (your example is stored in hexdat, by line)
# stripping off the leading whitespace
hexdat = map (lambda x: x.strip (), open ("dumpfile").readlines ())
for i in hexdat:
print i, reduce (operator.add, map (hex2ascii, i.split ()))
->
2067 726f 7570 732e 2e2e 3c2f 613e 3c2f groups...</a></
666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c font></td></tr><
7472 3e3c 7464 2062 6763 6f6c 6f72 3d23 tr><td bgcolor=#
6666 6363 3333 2063 6f6c 7370 616e 3d34 ffcc33 colspan=4
3e3c 494d 4720 6865 6967 6874 3d31 2073 ><IMG height=1 s
7263 3d22 2f69 6d61 6765 732f 636c 6561 rc="/images/clea
7264 6f74 2e67 6966 2220 7769 6474 683d rdot.gif" width=
3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74 1 ></td></tr></t
6162 6c65 3e3c 703e 3c66 6f6e 7420 7369 able><p><font si
7a65 3d2d 313e 4172 6520 796f 7520 6120 ze=-1>Are you a
Hope this helps, and critique most welcome...
G
--
George Demmy
Layton Graphics, Inc
Marietta, Georgia
More information about the Python-list
mailing list