How to find <tag> to </tag> HTML strings and 'save' them?

John Nagle nagle at animats.com
Sun Mar 25 20:27:45 EDT 2007


mark at agtechnical.co.uk wrote:
> Great, thanks so much for posting that. It's worked a treat and I'm
> getting HTML files with the list of h2 tags I was looking for. Here's
> the code just to share, what a relief :)   :
> ...............................
> from BeautifulSoup import BeautifulSoup
> import re
> 
> page = open("soup_test/tomatoandcream.html", 'r')
> soup = BeautifulSoup(page)
> 
> myTagSearch = str(soup.findAll('h2'))
> 
> myFile = open('Soup_Results.html', 'w')
> myFile.write(myTagSearch)
> myFile.close()
> 
> del myTagSearch
> ...............................
> 
> I do have two other small queries that I wonder if anyone can help
> with.
> 
> Firstly, I'm getting the following character: "[" at the start, "]" at
> the end of the code. Along with "," in between each tag line listing.
> This seems like normal behaviour but I can't find the way to strip
> them out.

Ah.  What you want is more like this:

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases

myFile = open('Soup_Results.html', 'w')

for htag in htags :	# for each H2 tag
     texts = htag.findAll(text=True) # find all text items within this h2
     s = ' '.join(texts).strip()	+ '\n'	# combine text items into clean string
     myFile.write(s) # write each text from an H2 element on a line.

myFile.close()

				John Nagle



More information about the Python-list mailing list