HTML parsing confusion

Alnilam alnilam at gmail.com
Wed Jan 23 07:40:14 EST 2008


On Jan 23, 3:54 am, "M.-A. Lemburg" <m... at egenix.com> wrote:

> >> I was asking this community if there was a simple way to use only the
> >> tools included with Python to parse a bit of html.
>
> There are lots of ways doing HTML parsing in Python. A common
> one is e.g. using mxTidy to convert the HTML into valid XHTML
> and then use ElementTree to parse the data.
>
> http://www.egenix.com/files/python/mxTidy.htmlhttp://docs.python.org/lib/module-xml.etree.ElementTree.html
>
> For simple tasks you can also use the HTMLParser that's part
> of the Python std lib.
>
> http://docs.python.org/lib/module-HTMLParser.html
>
> Which tools to use is really dependent on what you are
> trying to solve.
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Source  (#1, Jan 23 2008)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
> >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
> >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
>
> ________________________________________________________________________
>
> :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
>
>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611

Thanks. So far that makes 3 votes for BeautifulSoup, and one vote each
for libxml2dom, pyparsing, and mxTidy. I'm sure those would all be
great solutions, if I was looking to solve my coding question with
external modules.

Several folks have mentioned now that they think that if I have files
that are valid XHTML, that I could use htmllib, HTMLParser, or
ElementTree (all of which are part of the standard libraries in v
2.5).

Skipping past html validation, and html to xhtml 'cleaning', and
instead starting with the assumption that I have files that are valid
XHTML, can anyone give me a good example of how I would use _ htmllib,
HTMLParser, or ElementTree _ to parse out the text of one specific
childNode, similar to the examples that I provided above using regex?



More information about the Python-list mailing list