Extract Title from HTML documents

Walter Dörwald walter at livinglogic.de
Fri Nov 5 04:59:43 EST 2004


Nickolay Kolev wrote:

> Hi all,
> 
> I am looking for a way to extract the titles of HTML documents. I have 
> made an honest attempt at doing it, and it even works. Is there an 
> easier (faster / more efficient / clearer) way?

You might try XIST (http://www.livinglogic.de/Python/xist):
---
from ll.xist import parsers, xfind
from ll.xist.ns import html

e = parsers.parseFile("test.html", tidy=True)
print unicode(xfind.first(e//html.title))
---
(This uses libxml2's HTML parser internally).

Bye,
    Walter Dörwald




More information about the Python-list mailing list