HTML to formatted text conversion function

Wed Jul 25 11:23:42 EDT 2001

Rupert Scammell wrote:

> Recently I've been using a call like os.system("/usr/bin/lynx -dump
> http://www.sample.com > /tmp/site-text.txt") to grab formatted text
> versions of pages (without HTML) for subsequent processing. 
> However, I don't like the fact that this technique introduces an
> additional dependency into my code (lynx). I was wondering if
> anyone could recommend an equivalent Python function or module that
> lets me do this without introducing a platform specific dependency?
> 
> urllib.urlretrieve() gets back the raw HTML page, so it's not
> really helpful to me, except as a starting point for processing.

Use the HTML or SGML parser module. Mind you, I couldn't get the HTML-
parser to work [*], but the SGML parser works like a charm. Keep in 
mind, though, that it's not a full SGML parser, not by a long shot. It 
implements just enough to do HTML processing, but that suits me fine.

Robert Amesz
-- 
[*] It could be just me, but the documentation is very unhelpful, and 
there is no example code to dissect.