How *extract* data from XHTML Transitional web pages? got xml.dom.minidom troubles..

Thomas Dybdahl Ahle lobais at gmail.com
Fri Mar 2 19:04:38 EST 2007


Den Fri, 02 Mar 2007 15:32:58 -0800 skrev seberino at spawar.navy.mil:

> I'm trying to extract some data from an XHTML Transitional web page.
> xml.dom.minidom.parseString("text of web page") gives errors about it
> not being well formed XML.
> Do I just need to add something like <?xml ...?> or what?

As many HTML Transitional pages are very bad formed, you can't really 
create a dom of them.

I've written multiple grabbers, which grab tv data from html pages, and 
parses it into xml.

Basicly there are three ways to get the info:

  # Use find(): If you are only searching for a few data pieces, you 
might be able to find some html code always appearing before the data you 
need.

  # Use regular expressions: This can very quickly get all data from a 
table or so into a nice list. Only problem is regular expressions having 
a little steep learing curve.

  # Use a SAX parser: This will iterate through all html items, not 
carring if they validate or not. You will define a method to be called 
each time it finds a tag, a piece of text etc.

> What is best way to do this?

In the beginning I mostly did the SAX way, but it really generates a lot 
of code, which is not necessaryly more readable than the regular 
expressions.



More information about the Python-list mailing list