Help - mining secure web pages -for ligitmate use

Paul Boddie paul at boddie.net
Tue Jan 22 09:27:38 EST 2002


"Patrick F Harris" <patrickharris5 at home.com> wrote in message news:<SZ238.7981$Qc6.2659433 at news1.rdc2.pa.home.com>...
> All
> 
> I have a several web sites with financial I have to  'login'  in order to
> gain access. I want to login to these sites and access the data with a java
> program.

Sadly, I can't help you in Java, but I have been working on Python
classes to achieve the same thing. They aren't ready yet, and I'm not
entirely sure that I should release all of them either... convenient
scripting access to certain sites might cause higher loads on those
sites, invite systematic cracking attempts, and infuriate those who
run those sites.

> I know how to use java to grab a url page,  but have no idea how to
> automatically login to a secure site and down load pages.

Unfortunately, for sites where the user identifier and password are
submitted as information in a form, you need to know information about
the fields used, as well as the form action. The action may vary,
because the service may be tracking you in some way or other, and it
may be encoding the state of your interactions in the action URL;
alternatively, cookies may be employed, and you might need to store
them when they are issued and present them in subsequent requests.

All of this means that you can't always just invent a URL and request
the contents of it. Moreover, you shouldn't present sensitive
information in a "GET method" request (I believe), although doing a
"POST method" request is fairly straightforward with the Python
library these days.

Another problem that may be encountered is that of redirection. Some
services do lots of redirects to bounce you over to less loaded
servers, or to particular services. Whilst this may be a predictable
process, you may wish to handle redirects in your program.

Finally, you need to interpret the HTML in the pages served by the
service, unless you know how to get data of different content types
(and that such data exists). Again, it may not be enough to try and
guess the URLs which yield data, since the service may be "making them
up" in order to maintain state across requests. In addition, the
amount of data given on one particular page may not represent the
entire data set, and you may need to have plans for navigating around
a site, collecting all the data; this depends on the service you're
using, of course.

> I do not think this is illegal as I have legitimate accounts to access this
> sites, I just don't want to manual log in. I want to spool the data for
> offline analysis.

The only legal issues in accessing your own account using your own
program (or script) are likely to arise because you have used a
dedicated program, as opposed to a general Web browser application -
it may be part of the terms and conditions of using a given service
that you restrict yourself to accessing the service only with a Web
browser. See above for a few reasons as to why that might be...

Paul



More information about the Python-list mailing list