Extracting data from HTML
Paul Boddie
paul at boddie.net
Fri Jun 7 07:41:35 EDT 2002
Giulio Cespuglio <giulio.agostini.remove.this at libero.it> wrote in message news:<4slvfucpesg4ikfs3dfrjo8deio5de24ph at 4ax.com>...
> > http://www.boddie.org.uk/python/HTML.html
>
> Thanks a lot for this resource. To be honest, I wonder how you could
> work out how to use htmllib, IMHO the documentation is very poor.
> Actually, I am happily using the flexibility of regular expressions to
> do these things ATM, but I'm willing to give this library a try.
I think that each technique has its advantages and disadvantages:
Regular expressions: good for mining for data without caring
about document structure;
bad for detecting and reasoning about
the document structure
sgmllib
(and SAX-like technologies): good for mining for data whilst
"binding" that data to specific
elements;
bad for dealing with complicated
document structures
DOM-like technologies: good for insisting on particular
document structures and for keeping
these structures intact;
bad for casual mining of data - effort
is required to find data before it can
be extracted
So, if regular expressions aren't giving you the control you require,
you may want to consider one of the other technologies.
Paul
More information about the Python-list
mailing list