Slurping Web Pages

Tony Dunn tdunn at lynxxsolutions.com
Sun Jan 26 07:27:17 EST 2003


Thanks for all the responses!

I've gone the COM route for the first phase of this project, but it will
investigate the *urllib* option for the long term.

The *basic* code I ended up with is:

import win32com.client

# This is the main loop   -
# Issues: (1) Only runs in PythonWin
#            (2) Fails if ASCII code > 128 are present in the string

f_out = open("G:\\temp\Projects\slurpHtml\slurp.txt", "w")
x=win32com.client.Dispatch('InternetExplorer.Application.1')
x.Visible=1
x.Navigate('http://www.google.com')

html = x.Document.documentElement.innerHtml
f_out.write(str(html))

f_out.flush()
f_out.close()

There are a few wrinkles I need to work out, but at least I know now I'm
headed in the right direction...

-Tony

"Tony Dunn" <tdunn at lynxxsolutions.com> wrote in message
news:J3BY9.2087$Ec.128 at nwrddc02.gnilink.net...
> I've started a new project where I need to slurp web pages from a site
that
> use cookies to authenticate access.  I've used *urllib* in the past to
grab
> *public* web pages, but I'm not sure the best way to go about dealing with
> the cookie issue.
>
> I found some code to drive IE via COM, but I can't find a method to save
the
> current web page to a file so I can *slurp* it later.  I've wandered
through
> the file generated by makepy.py for the *Internet Control* COM object, but
I
> don't see what I'm looking for.  I know I can grab the files from the
local
> *Internet* cache, but I'd like the option to specify a file location and
> file name for each page downloaded.
>
> Has anyone done this with IE?
>
> -Tony
>
>






More information about the Python-list mailing list