Python equivalent of lynx -dump?

QuestionExchange USENET at questionexchange.com
Fri Apr 14 09:42:28 EDT 2000


The server only sends the raw HTML. If you want it formatted,
you need to format it yourself --
sort of. To retrieve the data from the server, you can use
urlopen from urllib. You could
alternatively use httplib, but that's generally only necessary
if you're doing something really
fancy and HTTP specific.
Once you've got the HTML, you can use htmllib to do the
parsing. It needs a "formatter",
which in turn needs a "writer" (see the fomatter module at
http://www.python.org/doc/current/lib/module-formatter.html).
The
formatter module has an AbstractFormatter and a DumbWriter,
which are both pretty basic,
but reasonably close to what "lynx -dump" does. If you want
better formatting, you can
write your own formatter and/or writer.
Here's some sample code that does basically what you want. Not
that I use a StringIO,
since DumbWriter wants to write to a file, but you want the
value in a string:
    from urllib import urlopen
    import htmllib
    import formatter
    # first, retrieve the HTML...
    html = urlopen(url).read()
    # create a "string file"...
    outfile = StringIO()
    # create a writer and formatter...
    myWriter = formatter.DumbWriter(outfile)
    myFormatter = formatter.AbstractFormatter(myWriter)
    # now parse and format the HTML...
    parser = htmllib.HTMLParser(myFormatter)
    parser.feed(html)
    parser.close()
    # get the formatted output
    data = outfile.getValue()

-- 
  This answer is courtesy of QuestionExchange.com
  http://www.questionexchange.com/servlet3/qx.usenetGuest.showUsenetGuest?ans_id=13247&cus_id=USENET



More information about the Python-list mailing list