Trimming X/HTML files
Walter Dörwald
walter at livinglogic.de
Thu Jul 28 06:30:06 EDT 2005
Thomas SMETS wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Dear,
>
> I need to parse XHTML/HTML files in all ways :
> ~ _ Removing comments and javascripts is a first issue
> ~ _ Retrieving the list of fields to submit is my following item (todo)
>
> Any idea where I could find this already made ... ?
You could try XIST (http://www.livinglogic.de/Python/xist).
Removing comments and javascripts works like this:
---
from ll.xist import xsc, parsers
from ll.xist.ns import html
e = parsers.parseURL("http://www.python.org/", tidy=True)
def removestuff(node, converter):
if isinstance(node, xsc.Comment):
node = xsc.Null
elif isinstance(node, html.script) and \
(unicode(node["type"]) == u"text/javascript" or \
unicode(node["language"]) == u"Javascript" \
):
node = xsc.Null
return node
e = e.mapped(removestuff)
print e.asBytes()
---
Retrieving the list of fields from all forms on a page might look like this:
---
from ll.xist import xsc, parsers, xfind
from ll.xist.ns import html
e = parsers.parseURL("http://www.python.org/", tidy=True)
for form in e//html.form:
print "Fields for %s" % form["action"]
for field in form//xfind.is_(html.input, html.textarea):
if "id" in field.attrs:
print "\t%s" % field["id"]
else:
print "\t%s" % field["name"]
---
This prints:
Fields for http://www.google.com/search
q
domains
sitesearch
sourceid
submit
Hope that helps!
Bye,
Walter Dörwald
More information about the Python-list
mailing list