Output of HTML parsing

Tue Jun 19 11:27:28 EDT 2007

Jackie schrieb:
> On 6 15 ,   2 01 , Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
>> Jackie wrote:
> 
>> import lxml.etree as et
>> url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
>> tree = et.parse(url)
>>
> 
>> Stefan-         -
>>
>> -         -
> 
> Thank you. But when I tried to run the above part, the following
> message showed up:
> 
> Traceback (most recent call last):
>   File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
> <module>
>     tree = et.parse(url)
>   File "etree.pyx", line 1845, in etree.parse
>   File "parser.pxi", line 928, in etree._parseDocument
>   File "parser.pxi", line 932, in etree._parseDocumentFromURL
>   File "parser.pxi", line 849, in etree._parseDocFromFile
>   File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
>   File "parser.pxi", line 631, in etree._handleParseResult
>   File "parser.pxi", line 602, in etree._raiseParseError
> etree.XMLSyntaxError: line 2845: Premature end of data in tag html
> line 8
> 
> Could you please tell me where went wrong?

Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
instead:

    parser = et.HTMLParser()
    tree = et.parse(url, parser)

Stefan