Parsing HTML

Thu Sep 23 06:41:35 EDT 2004

Am Thu, 23 Sep 2004 08:42:08 +0200 schrieb Anders Eriksson:

> Hello!
> 
> I want to extract some info from a some specific HTML pages, Microsofts
> International Word list (e.g.
> http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
> want to take all the words, both English and the other language and create
> a dictionary. so that I can look up About and get Om as the answer.
> 
> How is the best way to do this?

Hi,

If you only want to parse one page, I would use the re module.

If you want to parse many HTML pages, you can use tidy to create
xml and then use an xml parser. There are too many ways HTML can be
broken.

HTH,
 Thomas