Parsing HTML

Thu Sep 23 15:27:23 EDT 2004

Richie Hindle wrote:

> [Richie]
> 
>>BeautifulSoup is perfect for this job:
> 
> 
> Um, BeautifulSoup may be perfect, but my script isn't.  It fails with the
> Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
> And I don't know whether you'd consider it correct to extract only the bold
> text from the entries that have bold text.  But it gives you a place to start.
> 8-)

Another option might be the HTML parser from libxml2 (www.xmlsoft.org):

>>> import libxml2
 >>> doc = libxml2.htmlParseFile("http://www.python.org", None)
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid 
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
  ^
 >>> doc.serialize()
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...

Bye,
    Walter Dörwald