HTML parsing confusion

Tue Jan 22 07:37:25 EST 2008

On 22 Jan, 06:31, Alnilam <alni... at gmail.com> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on python.org, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.

It's unfortunate that you don't want to install extra modules, but I'd
probably use libxml2dom [1] for what you're about to describe...

> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
>
> Opening the file seems pretty straightforward.
>
> >>> import urllib
> >>> page = urllib.urlopen("http://diveintopython.org/")
> >>> source = page.read()
> >>> page.close()
>
> gets me to a string variable consisting of the un-parsed contents of
> the page.

Yes, there may be shortcuts that let some parsers read directly from
the server, but it's always good to have the page text around, anyway.

> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
>
> >>> from xml.dom.ext.reader import HtmlLib
> >>> reader = HtmlLib.Reader()
> >>> doc = reader.fromString(source)
>
> This gets me doc as <HTML Document at 9b4758>
>
> >>> paragraphs = doc.getElementsByTagName('p')
>
> gets me all of the paragraph children, and the one I specifically want
> can then be referenced with: paragraphs[5] This method seems to be
> pretty straightforward, but what do I do with it to get it into a
> string cleanly?

In less sophisticated DOM implementations, what you'd do is to loop
over the "descendant" nodes of the paragraph which are text nodes and
concatenate them.

> >>> from xml.dom.ext import PrettyPrint
> >>> PrettyPrint(paragraphs[5])
>
> shows me the text, but still in html, and I can't seem to get it to
> turn into a string variable, and I think the PrettyPrint function is
> unnecessary for what I want to do.

Yes, PrettyPrint is for prettyprinting XML. You just want to visit and
collect the text nodes.

>                                    Formatter seems to do what I want,
> but I can't figure out how to link the  "Element Node" at
> paragraphs[5] with the formatter functions to produce the string I
> want as output. I tried some of the htmllib.HTMLParser(formatter
> stuff) examples, but while I can supposedly get that to work with
> formatter a little easier, I can't figure out how to get HTMLParser to
> drill down specifically to the 6th paragraph's contents.

Given that you've found the paragraph above, you just need to write a
recursive function which visits child nodes, and if it finds a text
node then it collects the value of the node in a list; otherwise, for
elements, it visits the child nodes of that element; and so on. The
recursive approach is presumably what the formatter uses, but I can't
say that I've really looked at it.

Meanwhile, with libxml2dom, you'd do something like this:

  import libxml2dom
  d = libxml2dom.parseURI("http://www.diveintopython.org/", html=1)
  saved = None

  # Find the paragraphs.
  for p in d.xpath("//p"):

    # Get the text without leading and trailing space.
    text = p.textContent.strip()

    # Save the appropriate paragraph text.
    if text.startswith("Translations are freely permitted") and \
      text.endswith("just let me know."):

      saved = text
      break

The magic part of this code which saves you from needing to write that
recursive function mentioned above is the textContent property on the
paragraph element.

Paul

[1] http://www.python.org/pypi/libxml2dom