Parsing html
William Park
opengeometry at yahoo.ca
Thu Jul 8 15:22:18 EDT 2004
C Gillespie <csgcsg39 at hotmail.com> wrote:
> Dear All,
>
> I have hopefully a very simple problem. I wish to parse an html page and
> extract everything between the <body> tags.
>
> E.g.
> <head>
> <body>
> <b>afsdf</b>
> </body>
> </head>
>
> Would give
> <body>
> <b>afsdf</b>
> </body>
>
> I've been playing about with htmllib with no successful. Any suggestions?
>
> Thanks
>
> Colin
1. Take a look at
http://freshmeat.net/projects/bashdiff/
and if you want give it try then I'll give you some pointers.
Essentially,
x=()
array -p '<body>' -q '</body>' x "..."
2. In Python, read the whole thing as string. Delete everything before
'<body>' and everything after '</body>'.
3. Use your editor. :-)
--
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
Toronto, Ontario, Canada
More information about the Python-list
mailing list