Python equivalent of lynx -dump?
QuestionExchange
USENET at questionexchange.com
Fri Apr 14 09:42:28 EDT 2000
The server only sends the raw HTML. If you want it formatted,
you need to format it yourself --
sort of. To retrieve the data from the server, you can use
urlopen from urllib. You could
alternatively use httplib, but that's generally only necessary
if you're doing something really
fancy and HTTP specific.
Once you've got the HTML, you can use htmllib to do the
parsing. It needs a "formatter",
which in turn needs a "writer" (see the fomatter module at
http://www.python.org/doc/current/lib/module-formatter.html).
The
formatter module has an AbstractFormatter and a DumbWriter,
which are both pretty basic,
but reasonably close to what "lynx -dump" does. If you want
better formatting, you can
write your own formatter and/or writer.
Here's some sample code that does basically what you want. Not
that I use a StringIO,
since DumbWriter wants to write to a file, but you want the
value in a string:
from urllib import urlopen
import htmllib
import formatter
# first, retrieve the HTML...
html = urlopen(url).read()
# create a "string file"...
outfile = StringIO()
# create a writer and formatter...
myWriter = formatter.DumbWriter(outfile)
myFormatter = formatter.AbstractFormatter(myWriter)
# now parse and format the HTML...
parser = htmllib.HTMLParser(myFormatter)
parser.feed(html)
parser.close()
# get the formatted output
data = outfile.getValue()
--
This answer is courtesy of QuestionExchange.com
http://www.questionexchange.com/servlet3/qx.usenetGuest.showUsenetGuest?ans_id=13247&cus_id=USENET
More information about the Python-list
mailing list