Extract Title from HTML documents

Nickolay Kolev nmkolev at uni-bonn.de
Thu Nov 4 15:44:28 EST 2004


Hi all,

I am looking for a way to extract the titles of HTML documents. I have 
made an honest attempt at doing it, and it even works. Is there an 
easier (faster / more efficient / clearer) way?

------------ START SCRIPT --------------------

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):

     inside_title = False
     title = ''

     def start_title(self, attrs):
         self.inside_title = True

     def end_title(self):
         self.inside_title = False

     def handle_data(self, data):
         if self.inside_title and data:
             self.title = self.title + data + ' '

p = MyParser()
p.feed(file('test.html').read())
p.close()
print p.title.strip()

---------------- END SCRIPT -------------------------


Many thanks in advance!

Best regards,
Nickolay Kolev



More information about the Python-list mailing list