HTML parsing confusion

Tue Jan 22 04:33:43 EST 2008

On Jan 22, 4:31 pm, Alnilam <alni... at gmail.com> wrote:
> Sorry for the noob question, but I've gone through the documentation
> on python.org, tried some of the diveintopython and boddie's examples,
> and looked through some of the numerous posts in this group on the
> subject and I'm still rather confused. I know that there are some
> great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
> am trying to accomplish a simple task with a minimal (as in nil)
> amount of adding in modules that aren't "stock" 2.5, and writing a
> huge class of my own (or copying one from diveintopython) seems
> overkill for what I want to do.
>
> Here's what I want to accomplish... I want to open a page, identify a
> specific point in the page, and turn the information there into
> plaintext. For example, on thewww.diveintopython.orgpage, I want to
> turn the paragraph that starts "Translations are freely
> permitted" (and ends ..."let me know"), into a string variable.
>
> Opening the file seems pretty straightforward.
>
> >>> import urllib
> >>> page = urllib.urlopen("http://diveintopython.org/")
> >>> source = page.read()
> >>> page.close()
>
> gets me to a string variable consisting of the un-parsed contents of
> the page.
> Now things get confusing, though, since there appear to be several
> approaches.
> One that I read somewhere was:
>
> >>> from xml.dom.ext.reader import HtmlLib

Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
-1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
200-modules PyXML package installed. And you don't want the 75Kb
BeautifulSoup?