Slurping Web Pages

R.Marquez ny_r_marquez at yahoo.com
Mon Jan 27 12:28:21 EST 2003


I have recently began looking at going this route for a different
reason.  I am trying to use ie's authentication mechanism to be able
to get through our Windows proprietary authentication method.  I had a
thread started about this a while back if you are interested here is
the link:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF8&th=c29034c939dda8f7&rnum=1

Fortunately, back then I was just interested in getting at the html
code of the body, so I was able to get a working solution with what
others have shown you so far (I would use html =
x.Document.documentElement.outerHtml instead, however, so that you get
the body tag as well.)

Now, I am again looking at this same issue because none of the
standard programs for downloading mutiple pages from a site work on
our setup.  They can't get through the authentication.

So, here is what I've found so far in trying to accomplish it through
ie.  (I'm affraid it is going to be an unsightly hack, but if it works
I'll be happy :).

You are going to find web pages that do not display correctly by
simply downloading the body, you'll need the style sheet as well.  I
can find the link to the style sheet like this (from an interactive
session):

>>> print ie.Document.documentelement.nodeName
HEAD
>>> print ie.Document.documentelement.childnodes[0].childnodes[1].nodeName
LINK
>>> print ie.Document.documentelement.childnodes[0].childnodes[1].attributes.getNamedItem("HREF").nodeValue
../../stylesheets/style.css

Unfortunately, navigating to that file with ie opens the file in a
separate session of Notepad (and returns None to Python).  (I wonder
if it would do any good to associate .txt files with ie).

However, I think it may be possible to recreate the stylesheet by
recursively analizing (parsing) the stylesheet element.  It won't be
easy, but at least it seems doable. For example:

>>> print ie.Document.stylesheets[0].rules[0].style.color
#000000
(That stands for 'white')

The other problem is the images.  I have not found a way to save them.
 Python, and even the OS's command prompt, doesn't have access to the
cache folder.  It is not a regular folder.  So, you can't just copy
the images from it...

So, as usual, going the Microsoft way seems like a long road ahead,
but again as usual, it seems like the only way for some of us.  Any
way, thats where I'm at.  If you or any one else have some pointers,
they will be much appreciated.

-Ruben




More information about the Python-list mailing list