Parsing html

William Park opengeometry at yahoo.ca
Thu Jul 8 15:22:18 EDT 2004


C Gillespie <csgcsg39 at hotmail.com> wrote:
> Dear All,
> 
> I have hopefully a very simple problem. I wish to parse an html page and
> extract everything between the <body> tags.
> 
> E.g.
> <head>
>     <body>
>         <b>afsdf</b>
>     </body>
> </head>
> 
> Would give
> <body>
>     <b>afsdf</b>
> </body>
> 
> I've been playing about with htmllib with no successful. Any suggestions?
> 
> Thanks
> 
> Colin

1.  Take a look at
	http://freshmeat.net/projects/bashdiff/
    and if you want give it try then I'll give you some pointers.
    Essentially,
	x=()
	array -p '<body>' -q '</body>' x "..."

2.  In Python, read the whole thing as string.  Delete everything before
    '<body>' and everything after '</body>'.

3.  Use your editor. :-)

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
Toronto, Ontario, Canada



More information about the Python-list mailing list