SUBMIT "ACCEPT" button - Python - beautifulsoap

lazar.michael22 at gmail.com lazar.michael22 at gmail.com
Wed Jan 28 15:19:55 EST 2015


On Wednesday, January 28, 2015 at 8:36:59 AM UTC-8, peter.n... at gmail.com wrote:
> I am totally new to Python and please accept my apologies upfront for potential newbie errors. I am trying to parse a 'simple' web page: http://flow.gassco.no/
> 
> When opening the page first time in my browser I need to confirm T&C with an accept button. After accepting T&C I would like to scrape some data from that follow up page. It appears that when opening in a browser directly http://flow.gassco.no/acceptDisclaimer I would get around that T&C.
> But not when I open the URL via beautifulsoap
> 
> My parsing/scraping tool is implemented in bs, but I fail to parse the content as I am not getting around T&C. When printing "response.text" from BS, I get below code. How do I get around this form for accepting terms & conditions so that I can parse/scrape data from that page?
> 
> Here is what I am doing:
> 
> #!/usr/bin/env python 
> import requests 
> import bs4 
> index_url='http://flow.gassco.no/acceptDisclaimer'
> 
> def get_video_page_urls(): 
> response = requests.get(index_url) 
> soup = bs4.BeautifulSoup(response.text) 
> return soup 
> print(get_video_page_urls()) 
> 
> ++++
> 
> PRINTOUT from response.text:
> 
>    <form action="acceptDisclaimer" method="get">
>      <input class="accept" type="submit" value="Accept"/>
>      <input class="decline" name="decline" onclick="window.location ='http://www.gassco.no'" type="button" value="Decline"/>
>      </form></div></div></div></div></div>
> 
>     <script type="text/javascript">
>     var _gaq = _gaq || [];
>     _gaq.push(['_setAccount', 'UA-30727768-1']);
>     _gaq.push(['_trackPageview']);
> 
>     (function() {
>         var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
>         ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
>         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
>     })();
> 
> </script>

Try clearing your browser cookies and then reopening the page, it should spit you back to the TOC screen. 

You can use the Session class to keep track of your cookies between requests:

with requests.Session() as s:

    # Request sessionid cookie and store it in the current session
    response = s.get('http://flow.gassco.no')
    
    # Subsequent gets will now include the session cookie 
    response = s.get('http://flow.gassco.no/acceptDisclaimer')

A good place to start when debugging something like this is to open up the developer tools in your browser (F12 in chrome/firefox) and observe the GET requests that get sent out as you click on different buttons.



More information about the Python-list mailing list