HTML parsing confusion

Alnilam alnilam at gmail.com
Tue Jan 22 00:31:37 EST 2008


Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.

Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on the www.diveintopython.org page, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.

Opening the file seems pretty straightforward.

>>> import urllib
>>> page = urllib.urlopen("http://diveintopython.org/")
>>> source = page.read()
>>> page.close()

gets me to a string variable consisting of the un-parsed contents of
the page.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:

>>> from xml.dom.ext.reader import HtmlLib
>>> reader = HtmlLib.Reader()
>>> doc = reader.fromString(source)

This gets me doc as <HTML Document at 9b4758>

>>> paragraphs = doc.getElementsByTagName('p')

gets me all of the paragraph children, and the one I specifically want
can then be referenced with: paragraphs[5] This method seems to be
pretty straightforward, but what do I do with it to get it into a
string cleanly?

>>> from xml.dom.ext import PrettyPrint
>>> PrettyPrint(paragraphs[5])

shows me the text, but still in html, and I can't seem to get it to
turn into a string variable, and I think the PrettyPrint function is
unnecessary for what I want to do. Formatter seems to do what I want,
but I can't figure out how to link the  "Element Node" at
paragraphs[5] with the formatter functions to produce the string I
want as output. I tried some of the htmllib.HTMLParser(formatter
stuff) examples, but while I can supposedly get that to work with
formatter a little easier, I can't figure out how to get HTMLParser to
drill down specifically to the 6th paragraph's contents.

Thanks in advance.

- A.




More information about the Python-list mailing list