BeautifulSoup

Mike Meyer mwm at mired.org
Fri Aug 19 23:20:48 EDT 2005


"Paul McGuire" <ptmcg at austin.rr.com> writes:

> Here's a pyparsing program that reads my personal web page, and spits
> out HTML with all of the HREF's reversed.

Parsing HTML isn't easy, which makes me wonder how good this solution
really is. Not meant as a comment on the quality of this code or
PyParsing, but as curiosity from someone who does a lot of [X}HTML
herding.

> -- Paul
> (Download pyparsing at http://pyparsing.sourceforge.net.)

If it were in the ports tree, I'd have grabbed it and tried it
myself. But it isn't, so I'm going to be lazy and ask. If PyParsing
really makes dealing with HTML this easy, I may package it as a port
myself.

> from pyparsing import Literal, quotedString
> import urllib
>
> LT = Literal("<")
> GT = Literal(">")
> EQUALS = Literal("=")
> htmlAnchor = LT + "A" + "HREF" + EQUALS +
> quotedString.setResultsName("href") + GT
>
> def convertHREF(s,l,toks):
>     # do HREF conversion here - for demonstration, we will just reverse
> them
>     print toks.href
>     return "<A HREF=%s>" % toks.href[::-1]
>
> htmlAnchor.setParseAction( convertHREF )
>
> inputURL = "http://www.geocities.com/ptmcg"
> inputPage = urllib.urlopen(inputURL)
> inputHTML = inputPage.read()
> inputPage.close()
>
> print htmlAnchor.transformString( inputHTML )

How well does it deal with other attributes in front of the href, like
<A onClick="..." href="...">?

How about if my HTML has things that look like HTML in attributes,
like <TAG ATTRIBUTE="stuff<A HREF=stuff">?

     Thanks,
     <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list