Trimming X/HTML files

Walter Dörwald walter at livinglogic.de
Thu Jul 28 06:30:06 EDT 2005


Thomas SMETS wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> Dear,
> 
> I need to parse XHTML/HTML files in all ways :
> ~ _ Removing comments and javascripts is a first issue
> ~ _ Retrieving the list of fields to submit is my following item (todo)
> 
> Any idea where I could find this already made ... ?

You could try XIST (http://www.livinglogic.de/Python/xist).

Removing comments and javascripts works like this:

---
from ll.xist import xsc, parsers
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/", tidy=True)

def removestuff(node, converter):
    if isinstance(node, xsc.Comment):
       node = xsc.Null
    elif isinstance(node, html.script) and \
         (unicode(node["type"]) == u"text/javascript" or \
          unicode(node["language"]) == u"Javascript" \
         ):
        node = xsc.Null
    return node

e = e.mapped(removestuff)

print e.asBytes()
---

Retrieving the list of fields from all forms on a page might look like this:

---
from ll.xist import xsc, parsers, xfind
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/", tidy=True)

for form in e//html.form:
    print "Fields for %s" % form["action"]
    for field in form//xfind.is_(html.input, html.textarea):
       if "id" in field.attrs:
          print "\t%s" % field["id"]
       else:
          print "\t%s" % field["name"]
---

This prints:

Fields for http://www.google.com/search
    q
    domains
    sitesearch
    sourceid
    submit

Hope that helps!

Bye,
    Walter Dörwald



More information about the Python-list mailing list