How to find <tag> to </tag> HTML strings and 'save' them?
John Nagle
nagle at animats.com
Sun Mar 25 20:27:45 EDT 2007
mark at agtechnical.co.uk wrote:
> Great, thanks so much for posting that. It's worked a treat and I'm
> getting HTML files with the list of h2 tags I was looking for. Here's
> the code just to share, what a relief :) :
> ...............................
> from BeautifulSoup import BeautifulSoup
> import re
>
> page = open("soup_test/tomatoandcream.html", 'r')
> soup = BeautifulSoup(page)
>
> myTagSearch = str(soup.findAll('h2'))
>
> myFile = open('Soup_Results.html', 'w')
> myFile.write(myTagSearch)
> myFile.close()
>
> del myTagSearch
> ...............................
>
> I do have two other small queries that I wonder if anyone can help
> with.
>
> Firstly, I'm getting the following character: "[" at the start, "]" at
> the end of the code. Along with "," in between each tag line listing.
> This seems like normal behaviour but I can't find the way to strip
> them out.
Ah. What you want is more like this:
page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)
htags = soup.findAll({'h2':True, 'H2' : True}) # get all H2 tags, both cases
myFile = open('Soup_Results.html', 'w')
for htag in htags : # for each H2 tag
texts = htag.findAll(text=True) # find all text items within this h2
s = ' '.join(texts).strip() + '\n' # combine text items into clean string
myFile.write(s) # write each text from an H2 element on a line.
myFile.close()
John Nagle
More information about the Python-list
mailing list