Extracting data from HTML

Paul Boddie paul at boddie.net
Fri Jun 7 07:41:35 EDT 2002


Giulio Cespuglio <giulio.agostini.remove.this at libero.it> wrote in message news:<4slvfucpesg4ikfs3dfrjo8deio5de24ph at 4ax.com>...
> >  http://www.boddie.org.uk/python/HTML.html
> 
> Thanks a lot for this resource. To be honest, I wonder how you could
> work out how to use htmllib, IMHO the documentation is very poor.
> Actually, I am happily using the flexibility of regular expressions to
> do these things ATM, but I'm willing to give this library a try.

I think that each technique has its advantages and disadvantages:

  Regular expressions:         good for mining for data without caring
                               about document structure;
                               bad for detecting and reasoning about
                               the document structure
  sgmllib
  (and SAX-like technologies): good for mining for data whilst
                               "binding" that data to specific
                               elements;
                               bad for dealing with complicated
                               document structures

  DOM-like technologies:       good for insisting on particular
                               document structures and for keeping
                               these structures intact;
                               bad for casual mining of data - effort
                               is required to find data before it can
                               be extracted

So, if regular expressions aren't giving you the control you require,
you may want to consider one of the other technologies.

Paul



More information about the Python-list mailing list