not able to HTTPS page from python

Larry Bates larry.bates at websafe.com
Wed Nov 9 15:52:09 EST 2005


It is possible that the links have been obscured (something
I do on my own web pages) by inserting Javascript that creates
the links on the fly using document.write().  That way web
spiders can't go through the web pages and easily pick up email
addresses to send spam to all my employees.  Just a thought
since you have spent days on this.

-Larry Bates


muttu2244 at yahoo.com wrote:
> Hi all,
> 
> Am trying to read a email ids which will be in the form of links ( on
> which if we click, they will redirect to outlook with their respective
> email ids).
> 
> And these links are in the HTTPS page, a secured http page.
> 
> The point is that am able to read some links with HTTP page, but am not
> able to read the same when I try with HTTPS.
> 
> 
> 
> Using the following code from sgmllib am able to read the links,
> 
> 
> 
> 
> 
> class MyParser(sgmllib.SGMLParser):
> 
>     def __init__(self):
> 
>         sgmllib.SGMLParser.__init__(self)
> 
>         self.inside_a = False
> 
>         self.address = ''
> 
>     def start_a(self,attrs):
> 
>         if DEBUG:
> 
>             print "start_a"
> 
>             print attrs
> 
>         for attr,value in attrs:
> 
>             if attr == 'href' and value.startswith('mailto:'):
> 
>                 self.address = value[7:]
> 
>         self.inside_a = True
> 
>     def end_a(self):
> 
>         if DEBUG:
> 
>             print "end_a"
> 
>         if self.address:
> 
>             print '"%s" <%s>' % (self.nickname, self.address)
> 
>             mailIdList.append(self.address)
> 
> 
> 
>         self.inside_a = False
> 
>         self.address = self.nickname = ''
> 
>     def handle_data(self,data):
> 
>         if self.inside_a:
> 
>             self.nickname = data
> 
> 
> 
> 
> 
> And for the proxy authentication and the https handler am using the
> following lines of code
> 
> 
> 
> 
> 
>             authinfo = urllib2.HTTPBasicAuthHandler()
> 
> 
> 
>             proxy_support = urllib2.ProxyHandler ({"http" :
> "http://user:password@proxyname:port"})
> 
> 
> 
>             opener = urllib2.build_opener(proxy_support, authinfo,
> urllib2.HTTPSHandler)
> 
> 
> 
>             urllib2.install_opener(opener)
> 
> 
> 
> 
> 
> Then am trying to call the parser for the links in a particular https
> page which will be given as a command line argument. Which will read me
> all the links in that page.
> 
> 
> 
> p = MyParser()
> 
> for ln in urllib2.urlopen( sys.argv[1] ):
> 
>     p.feed(ln)
> 
> p.close()
> 
> 
> 
> 
> 
> NOTE : I have installed python with _ssl support also.
> 
> 
> 
> 
> 
> So with this code am able to read the links with HTTP page but not for
> the HTTPS page.
> 
> AM NOT GETTING ANY ERRORS EITHER BUT ITS NOT READING THE LINKS, THAT
> ARE PRESENT IN THE GIVEN HTTPS PAGE
> 
> 
> 
> Could you please tell me am I doing some thing wrong in the above code
> for any of the handlers.
> 
> 
> 
> 
> 
> I have got struck here from so many days, please give me the solution
> for this.
> 
>  
> 
> Thanks and regards
> 
> YOGI
> 



More information about the Python-list mailing list