Documentation/Examples about the htmllib?

Doug Fort dougfort at downright.com
Thu Mar 8 14:44:53 EST 2001


httplib seems to want to  prettyprint only.

To get actual tags, we use sgmllib.  I've attached the module we use  to
extract <form> tag stuff.   Getting <area>  tags shoulld be similar.
--
Doug Fort <dougfort at downright.com>
Senior  Meat Manager
Downright Software LLC
http://www.dougfort.net


Hermann Himmelbauer wrote:

> Hi,
> I want to parse a htmlpage with python, so I thought using the htmllib
> would be good for that task.
>
> In my html-page I have this tag:
> <area href="/html/page.html?key" .... >
>
> What I want to do is extract this link parameter "key".
>
> It would be perfect if the htmllib would extract me all the <area ...> tags
> into a list so that I could simply find the rigth tag and extract the key.
>
> I did manage to get data between a tag like <title> data </title> but could
> not get the data "in" the tag itself. Of course I could do this with a
> regular expression but I thought using this module would give me better
> results, what do you think?
>
> Does anyone have a clue?
>
>                 Best Regards,
>                 Hermann
>
> --
>  ,_,
> (O,O)     "There is more to life than increasing its speed."
> (   )     -- Gandhi
> -"-"--------------------------------------------------------------

--
Doug Fort (dougfort at downright.com)
Senior Meat Manager
Downright Software LLC
http://www.dougfort.net


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20010308/ed37edd9/attachment.html>
-------------- next part --------------
#!/usr/bin/env python
"""
FormFieldParser

This object parses HTML text and builds a dictionary of
dictionaries of form fields

$Id: formfieldparser.py,v 1.1 2001/01/26 15:18:30 dougfort Exp $
"""
__author__="Downright Software LLC"
__version__="$Revision: 1.1 $"[11:-2]

import sgmllib
import string
import cStringIO
import urllib
import re

import webnudge.util.misc
import webnudge.util.document

class FormFieldParserException:
    def __init__(self, message):
        self._message = message
    def __str__(self):
        return self._message

###########################################################
class FormFieldParser(sgmllib.SGMLParser):
###########################################################
    """
    FormFieldParser class. Parse a page from a website,
    creating a dictionary of dictionairies of form
    fields
    """

    #----------------------------------------------------------
    def __init__(self):
    #----------------------------------------------------------
        """
        Constructor
        """
        sgmllib.SGMLParser.__init__(self)

        self._formcount = 0
        self._formdict = {}

    #----------------------------------------------------------
    def parse(self, text):
    #----------------------------------------------------------
        """
        parse some text, without trashing javascript
        """
        self.feed(text)
        self.close()
        return self._formdict
        
    #----------------------------------------------------------
    def start_form(self,attributes):
    #----------------------------------------------------------
        """
        start a form
        """
        self._formdict[self._formcount] = {}
    #----------------------------------------------------------
    def end_form(self):
    #----------------------------------------------------------
        """
        end a form
        """
        self._formcount += 1
        
    #----------------------------------------------------------
    def _storeformfield(self,attributes,multivalue=0):
    #----------------------------------------------------------
        """
        Capture name and value attributes of a form field
        """
        tagname = None
        tagvalue = ""
        selected = 0
        for key, value in attributes:
            if key == "name":
                tagname = value
                continue
            if key == "value":
                tagvalue = value
                continue
            if key == "selected":
                selected = 1
                continue
        if multivalue and not selected:
            return
            
        if tagname:
            self._formdict[self._formcount][tagname] = tagvalue
        
    #----------------------------------------------------------
    def do_input(self,attributes):
    #----------------------------------------------------------
        """
        Capture <input> element
        """
        self._storeformfield(attributes)
        
    #----------------------------------------------------------
    def do_option(self,attributes):
    #----------------------------------------------------------
        """
        Capture <option> element
        """
        self._storeformfield(attributes, multivalue=1)
        
    #----------------------------------------------------------
    def do_select(self,attributes):
    #----------------------------------------------------------
        """
        Capture <select> element
        """
        self._storeformfield(attributes, multivalue=1)
        
    #----------------------------------------------------------
    def do_textarea(self,attributes):
    #----------------------------------------------------------
        """
        Capture <textarea> element
        """
        self._storeformfield(attributes)
        
#----------------------------------------------------------
if __name__ == "__main__":
#----------------------------------------------------------
    """
    Code for commandline testing
    """
    import sys
    if len(sys.argv) != 2:
        print "Usage:  filteringparser.py <url>"
        sys.exit(-1)

    import webnudge.util.rawhtmlpage
    page = webnudge.util.rawhtmlpage.RawHTMLPage()
    page.load(sys.argv[1])
    if not page:
        print "*** Error *** %s" % (page._message)
        sys.exit(-1)

    result = FormFieldParser().parse(page._data)
    
    sys.stdout.write(repr(result))




More information about the Python-list mailing list