How to extract a part of html file

Thu Oct 20 09:47:37 EDT 2005

Ben Finney <bignose+hates-spam at benfinney.id.au> writes:

> Joe <dinamo99 at lycos.com> wrote:
>> I'm trying to extract part of html code from a tag to a tag
> For tag soup, use BeautifulSoup:
>     <URL:http://www.crummy.com/software/BeautifulSoup/>

Except he's trying to extract an apparently random part of the
file. BeautifulSoup is a wonderful thing for dealing with X/HTML
documents as structured documents, which is how you want to deal with
them most of the time.

In this case, an re works nicely:

>>> import re
>>> s = '<span class="boldyellow"><B><U>  and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>'
>>> r = re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', s)
>>> r.group(1)
'  and ends with '
>>> 

String.find also works really well:

>>> start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
>>> stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
>>> s[start:stop]
'  and ends with '
>>> 

Not a lot to choose between them.

    <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.