Extract Title from HTML documents

Mike Meyer mwm at mired.org
Fri Nov 5 01:49:29 EST 2004


Nickolay Kolev <nmkolev at uni-bonn.de> writes:

> Hi all,
>
> I am looking for a way to extract the titles of HTML documents. I have
> made an honest attempt at doing it, and it even works. Is there an
> easier (faster / more efficient / clearer) way?
>
> ------------ START SCRIPT --------------------
>
> #!/usr/bin/python
>
> import sgmllib
>
> class MyParser(sgmllib.SGMLParser):
>
>      inside_title = False
>      title = ''
>
>      def start_title(self, attrs):
>          self.inside_title = True
>
>      def end_title(self):
>          self.inside_title = False
>
>      def handle_data(self, data):
>          if self.inside_title and data:
>              self.title = self.title + data + ' '

I'm pretty sure the trailing "+ ' '" is wrong. At least I never needed
it when I was using sgmllib for this kind of thing.

    <mike

> p = MyParser()
> p.feed(file('test.html').read())
> p.close()
> print p.title.strip()
>
> ---------------- END SCRIPT -------------------------
>
>
> Many thanks in advance!
>
> Best regards,
> Nickolay Kolev

-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list