advanced regex, was: Re: scanf style parsing
Skip Montanaro
skip at pobox.com
Thu Oct 4 22:14:41 EDT 2001
Hans-Peter> Well, yesterday, I tried to parse some simple hexdump,
Hans-Peter> produced by tcpdump -xs1500 port 80. The idea was, filter
Hans-Peter> the hexcodes, and display and 7 bit acsii codes like a
Hans-Peter> little advanced hex monitors do.
Hans-Peter> As I'm fairly new to advanced regex constructs, would
Hans-Peter> somebody enlight me, how to efficiently parse lines like:
Hans-Peter> 2067 726f 7570 732e 2e2e 3c2f 613e 3c2f
Hans-Peter> 666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c
Hans-Peter> 7472 3e3c 7464 2062 6763 6f6c 6f72 3d23
Hans-Peter> 6666 6363 3333 2063 6f6c 7370 616e 3d34
Hans-Peter> 3e3c 494d 4720 6865 6967 6874 3d31 2073
Hans-Peter> 7263 3d22 2f69 6d61 6765 732f 636c 6561
Hans-Peter> 7264 6f74 2e67 6966 2220 7769 6474 683d
Hans-Peter> 3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74
Hans-Peter> 6162 6c65 3e3c 703e 3c66 6f6e 7420 7369
Hans-Peter> 7a65 3d2d 313e 4172 6520 796f 7520 6120
Hans-Peter> with respect to varying column numbers. I will refrain to
Hans-Peter> show my stupid beginnings, but I wasn't able to get that
Hans-Peter> _one_ regex right, with all columns in matchobj.groups()
Hans-Peter> listed.
I'm not sure quite what you're looking for, but this data is so regular I
wouldn't use regular expressions to parse it (no pun intended).
Assuming the above stream is coming in on stdin and I wanted to display
any printable ASCII characters, I'd start with something like this:
import sys
for line in sys.stdin.readlines():
line = line.strip()
fields = line.split()
printing = []
for pair in fields:
first = chr(int(pair[:2], 16))
second = chr(int(pair[2:], 16))
if first < " " or first > "~":
first = "."
if second < " " or second > "~":
second = "."
printing.extend([first, second])
print line, "".join(printing)
The above hex data fed to this code produces
2067 726f 7570 732e 2e2e 3c2f 613e 3c2f groups...</a></
666f 6e74 3e3c 2f74 643e 3c2f 7472 3e3c font></td></tr><
7472 3e3c 7464 2062 6763 6f6c 6f72 3d23 tr><td bgcolor=#
6666 6363 3333 2063 6f6c 7370 616e 3d34 ffcc33 colspan=4
3e3c 494d 4720 6865 6967 6874 3d31 2073 ><IMG height=1 s
7263 3d22 2f69 6d61 6765 732f 636c 6561 rc="/images/clea
7264 6f74 2e67 6966 2220 7769 6474 683d rdot.gif" width=
3120 3e3c 2f74 643e 3c2f 7472 3e3c 2f74 1 ></td></tr></t
6162 6c65 3e3c 703e 3c66 6f6e 7420 7369 able><p><font si
7a65 3d2d 313e 4172 6520 796f 7520 6120 ze=-1>Are you a
on stdout.
--
Skip Montanaro (skip at pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/
More information about the Python-list
mailing list