How to find <tag> to </tag> HTML strings and 'save' them?

mark at agtechnical.co.uk mark at agtechnical.co.uk
Sun Mar 25 18:44:17 EDT 2007


Great, thanks so much for posting that. It's worked a treat and I'm
getting HTML files with the list of h2 tags I was looking for. Here's
the code just to share, what a relief :)   :
...............................
from BeautifulSoup import BeautifulSoup
import re

page = open("soup_test/tomatoandcream.html", 'r')
soup = BeautifulSoup(page)

myTagSearch = str(soup.findAll('h2'))

myFile = open('Soup_Results.html', 'w')
myFile.write(myTagSearch)
myFile.close()

del myTagSearch
...............................

I do have two other small queries that I wonder if anyone can help
with.

Firstly, I'm getting the following character: "[" at the start, "]" at
the end of the code. Along with "," in between each tag line listing.
This seems like normal behaviour but I can't find the way to strip
them out.

There's an example of stripping comments and I understand the example,
but what's the *reference* to the above '[', ']' and ',' elements?
for the comma I tried:
   soup.find(text=",").replaceWith("")

but that throws this error:
   AttributeError: 'NoneType' object has no attribute 'replaceWith'

Again working with the 'Removing Elements' example I tried:
   soup = BeautifulSoup("you are a banana, banana, banana")
   a = str(",")
   comments = soup.findAll(text=",")
   [",".extract() for "," in comments]
But if I'm doing 'import beautifulSoup' this give me a "soup =
BeautifulSoup("you are a banana, banana, banana")
TypeError: 'module' object is not callable" error, "import
beautifulSoup from BeautifulSoup" does nothing

Secondly, in the above working code that is just pulling the h2 tags -
how the blazes do I 'prettify' before writing to the file?

Thanks in advance!

Mark.

..................

On Mar 25, 6:51 pm, Jorge Godoy <jgo... at gmail.com> wrote:
> m... at agtechnical.co.uk writes:
> > Hi All,
>
> > Apologies for the newbie question but I've searched and tried all
> > sorts for a few days and I'm pulling my hair out ;[
>
> > I have a 'reference' HTML file and a 'test' HTML file from which I
> > need to pull 10 strings, all of which are contained within <h2> tags,
> > e.g.:
> > <h2 class=r><a href="http://www.someplace.com/">Go Someplace</a></h2>
>
> > Once I've found the 10 I'd like to write them to another 'results'
> > html file. Perhaps a 'reference results' and a 'test results' file.
> >>From where I would then like to 'diff' the results to see if they
> > match.
>
> > Here's the rub: I cannot find a way to pull those 10 strings so I can
> > save them to the results pages.
> > Can anyone please suggest how this can be done?
>
> > I've tried allsorts but I've been learning Python for 1 week and just
> > don't know enough to mod example scripts it seems. don't even get me
> > started on python docs.. ayaa ;] Please feel free to teach me to suck
> > eggs because it's all new to me :)
>
> > Thanks in advance,
>
> > Mark.
>
> Take a look at BeautifulSoup.  It is easy to use and works well with some
> malformed HTML that you might find ahead.
>
> --
> Jorge Godoy      <jgo... at gmail.com>- Hide quoted text -
>
> - Show quoted text -





More information about the Python-list mailing list