[Tutor] HTML --> TXT?

Corran Webster cwebster@nevada.edu
Wed, 29 Mar 2000 11:10:42 -0800


At 11:47 AM -0500 29/3/00, Justin Sheehy wrote:
> "Curtis Larsen" <curtis.larsen@Covance.Com> writes:
>
> > Is there a fairly simple Python-ish way to convert an HTML file to text?
>
> Check out the htmllib and formatter modules.  The HTMLParser and
> DumbWriter classes in those respective modules should do what you need.

In particular, the following should do the trick for a basic text-dump to
standard output:

----
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter

source = open("myfile.html")

parser = HTMLParser(AbstractFormatter(DumbWriter()))
parser.feed(source.read())
parser.close()
----

'source' can be replaced by any file-like object (such as the file-like
objects returned by urllib.urlopen).  For example:

----
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
from urllib import urlopen

source = urlopen('http://www.yahoo.com/')

parser = HTMLParser(AbstractFormatter(DumbWriter()))
parser.feed(source.read())
parser.close()
----

You can also specify an output file for DumbWriter, and adjust the way that
lines wrap.

More sophisticated behaviour can be achieved by subclassing the Writer
and/or Formatter classes from the formatter module; or the HTMLParser class
(usually overriding the start_tag, end_tag or do_tag methods for specific
tags).  See the documentation for details of the interfaces.

Regards,
Corran