HTML parsing confusion

Paul Boddie paul at
Tue Jan 22 07:37:25 EST 2008

On 22 Jan, 06:31, Alnilam <alni... at> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.

It's unfortunate that you don't want to install extra modules, but I'd
probably use libxml2dom [1] for what you're about to describe...

> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
> Opening the file seems pretty straightforward.
> >>> import urllib
> >>> page = urllib.urlopen("")
> >>> source =
> >>> page.close()
> gets me to a string variable consisting of the un-parsed contents of
> the page.

Yes, there may be shortcuts that let some parsers read directly from
the server, but it's always good to have the page text around, anyway.

> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
> >>> from xml.dom.ext.reader import HtmlLib
> >>> reader = HtmlLib.Reader()
> >>> doc = reader.fromString(source)
> This gets me doc as <HTML Document at 9b4758>
> >>> paragraphs = doc.getElementsByTagName('p')
> gets me all of the paragraph children, and the one I specifically want
> can then be referenced with: paragraphs[5] This method seems to be
> pretty straightforward, but what do I do with it to get it into a
> string cleanly?

In less sophisticated DOM implementations, what you'd do is to loop
over the "descendant" nodes of the paragraph which are text nodes and
concatenate them.

> >>> from xml.dom.ext import PrettyPrint
> >>> PrettyPrint(paragraphs[5])
> shows me the text, but still in html, and I can't seem to get it to
> turn into a string variable, and I think the PrettyPrint function is
> unnecessary for what I want to do.

Yes, PrettyPrint is for prettyprinting XML. You just want to visit and
collect the text nodes.

>                                    Formatter seems to do what I want,
> but I can't figure out how to link the  "Element Node" at
> paragraphs[5] with the formatter functions to produce the string I
> want as output. I tried some of the htmllib.HTMLParser(formatter
> stuff) examples, but while I can supposedly get that to work with
> formatter a little easier, I can't figure out how to get HTMLParser to
> drill down specifically to the 6th paragraph's contents.

Given that you've found the paragraph above, you just need to write a
recursive function which visits child nodes, and if it finds a text
node then it collects the value of the node in a list; otherwise, for
elements, it visits the child nodes of that element; and so on. The
recursive approach is presumably what the formatter uses, but I can't
say that I've really looked at it.

Meanwhile, with libxml2dom, you'd do something like this:

  import libxml2dom
  d = libxml2dom.parseURI("", html=1)
  saved = None

  # Find the paragraphs.
  for p in d.xpath("//p"):

    # Get the text without leading and trailing space.
    text = p.textContent.strip()

    # Save the appropriate paragraph text.
    if text.startswith("Translations are freely permitted") and \
      text.endswith("just let me know."):

      saved = text

The magic part of this code which saves you from needing to write that
recursive function mentioned above is the textContent property on the
paragraph element.



More information about the Python-list mailing list