Parsing HTML
Walter Dörwald
walter at livinglogic.de
Thu Sep 23 15:27:23 EDT 2004
Richie Hindle wrote:
> [Richie]
>
>>BeautifulSoup is perfect for this job:
>
>
> Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
> Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
> And I don't know whether you'd consider it correct to extract only the bold
> text from the entries that have bold text. But it gives you a place to start.
> 8-)
Another option might be the HTML parser from libxml2 (www.xmlsoft.org):
>>> import libxml2
>>> doc = libxml2.htmlParseFile("http://www.python.org", None)
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
^
>>> doc.serialize()
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...
Bye,
Walter Dörwald
More information about the Python-list
mailing list