HTML page into a string

Tue Feb 7 23:20:20 EST 2006

Tempo wrote:
> In my last post I received some advice to use urllib.read() to get a
> whole html page as a string, which will then allow me to use
> BeautifulSoup to do what I want with the string. But when I was
> researching the 'urllib' module I couldn't find anything about its
> sub-section '.read()' ? Is that the right module to get a html page
> into a string? Or am I completely missing something here? I'll take
> this as the more likely of the two cases. Thanks for any and all help.
> 
I think you've misunderstood. You call urllib.urlopen() with a URL as an 
argument. The object that this call returns is file-like (in so far as 
you can read it to get the content of the web page):

  >>> import urllib
  >>> page = urllib.urlopen("http://www.holdenweb.com/")
  >>> data = page.read()
  >>> print data
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
   <meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
   <meta name="generator" content="Adobe GoLive 6">
   <meta http-equiv="DESCRIPTION" content="Holden Web provides 
architectural design of databases and information systems, with 
full-service implementation and support">
  ...
     </tr>
   </tbody>
</table>
</div>
</body>
</html>
  >>>

You will find there are lots of other things you can do with that 
file-like object too, but reading it is the important one as far as 
using BeautifulSoup goes.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC                     www.holdenweb.com
PyCon TX 2006                  www.python.org/pycon/