Converting HTML to ASCII

Jorgen Grahn jgrahn-nntq at algonet.se
Sat Feb 26 20:46:05 EST 2005


On 26 Feb 2005 02:36:31 -0800, Paul Rubin <> wrote:
> Jorgen Grahn <jgrahn-nntq at algonet.se> writes:
>> You should probably do what some other poster suggested -- download
>> lynx or some other text-only browser and make your code execute it
>> in -dump mode to get the text-formatted html. You'll get that
>> working in an hour or so, and then you can see if you need something
>> more complicated.
> 
> Lynx is pathetically slow for large files.  It seems to use a
> quadratic algorithm for remembering where the links point, or
> something.  I wrote a very crude but very fast renderer in C that I
> can post if someone wants it, which is what I use for this purpose.

That may be so, but it's fast enough for all the people who use it as a
general html->plaintext tool, so it's probably good enough for the OP.

w3m and links are other options. They provide better formatting than lynx,
and at least w3m has the -dump option.

I wouldn't mind if there was a reusable library for rendering HTML to text,
from various languages. I'd also like to see one (CSS-aware) for rendering
to troff or Postscript.

/Jorgen

-- 
  // Jorgen Grahn <jgrahn@       Ph'nglui mglw'nafh Cthulhu
\X/                algonet.se>   R'lyeh wgah'nagl fhtagn!



More information about the Python-list mailing list