Querying a complex website

7stud bbxx789_05ss at yahoo.com
Wed Feb 20 20:06:46 EST 2008


7stud wrote:
> schweet1 wrote:
> > On Feb 19, 4:04�pm, 7stud <bbxx789_0... at yahoo.com> wrote:
> > > schweet1 wrote:
> > > > Greetings,
> > >
> > > > I am attempting to use python to submit a query to the following URL:
> > >
> > > >https://ramps.uspto.gov/eram/patentMaintFees.do
> > >
> > > > The page looks simple enough - it requires submitting a number into 2
> > > > form boxes and then selecting from the pull down.
> > >
> > > > However, my test scripts have been hung up, apparently due to the
> > > > several buttons on the page having the same name. �Ideally, I would
> > > > have the script use the "Get Bibligraphic Data" link.
> > >
> > > > Any assistance would be appreciated.
> > >
> > > > ~Jon
> > >
> > > This is the section you are interested in:
> > >
> > > -------------
> > > <tr>
> > > <td colspan=3><input type="submit" name="maintFeeAction"
> > > value="Retrieve Fees to Pay"> </td>
> > > </tr>
> > >
> > > <tr>
> > > <td colspan=3><input type="submit" name="maintFeeAction" value="Get
> > > Bibliographic Data"> </td>
> > > </tr>
> > >
> > > <tr>
> > > <td colspan=3><input type="submit" name="maintFeeAction" value="View
> > > Payment Windows"> </td>
> > > </tr>
> > > <tr>
> > > ------------
> > >
> > > 1) When you click on a submit button on a web page, a request is sent
> > > out for the web page listed in the action attribute of the <form> tag,
> > > which in this case is:
> > >
> > > <form name="mfInputForm" method="post" action="/eram/
> > > getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:11g0uepfb">
> > >
> > > The url specified in the action attribute is a relative url. �The
> > > current url in the address bar of your browser window is:
> > >
> > > https://ramps.uspto.gov/eram/patentMaintFees.do
> > >
> > > and if you compare that to the url in the action attribute of the
> > > <form> tag:
> > >
> > > ---------https://ramps.uspto.gov/eram/patentMaintFees.do
> > >
> > > /eram/getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:
> > > 11g0uepfb
> > > ---------
> > >
> > > you can piece them together and get the absolute url:
> > >
> > > https://ramps.uspto.gov/eram/getMaintFeesInfo.do;jsessionid=0000-MCoY...
> > >
> > > 2) When you click on a submit button, a request is sent to that url.
> > > The request will contain all the information you entered into the form
> > > as name/value pairs. �The name is whatever is specified in the name
> > > attribute of a tag and the value is whatever is entered into the form.
> > >
> > > Because the submit buttons in the form have name attributes, �the name
> > > and value of the particular submit button that you click will be added
> > > to the request.
> > >
> > > 3) �To programmatically mimic what happens in your browser when you
> > > click on the submit button of a form, you need to send a request
> > > directly to the url listed in the action attribute of the <form>.
> > > Your request will contain the name/value pairs that would have been
> > > sent to the server if you had actually filled out the form and clicked
> > > on the 'Get Bibliographic Data' submit button. �The form contains
> > > these input elements:
> > >
> > > ----
> > > <input type="text" name="patentNum" maxlength="7" size="7" value="">
> > >
> > > <input type="text" name="applicationNum" maxlength="8" size="8"
> > > value="">
> > > ----
> > >
> > > and the submit button you want to click on is this one:
> > >
> > > <input type="submit" name="maintFeeAction" value="Get Bibliographic
> > > Data">
> > >
> > > So the name value pairs you need to include in your request are:
> > >
> > > data = {
> > > � � 'patentNum':'1234567',
> > > � � 'applicationNum':'08123456',
> > > � � 'maintFeeAction':'Get Bibliographic Data'
> > >
> > > }
> > >
> > > Therefore, try something like this:
> > >
> > > import urllib
> > >
> > > data = {
> > > � � 'patentNum':'1234567',
> > > � � 'applicationNum':'08123456',
> > > � � 'maintFeeAction':'Get Bibliographic Data'
> > >
> > > }
> > >
> > > enc_data = urllib.urlencode(data)
> > > url = 'https://ramps.uspto.gov/eram/
> > > getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:11g0uepfb'
> > >
> > > f = urllib.urlopen(url, enc_data)
> > >
> > > print f.read()
> > > f.close()
> > >
> > > If that doesn't work, you may need to deal with cookies that the
> > > server requires in order to keep track of you as you navigate from
> > > page to page. �In that case, please post a valid patent number and
> > > application number, so that I can do some further tests.- Hide quoted text -
> > >
> > > - Show quoted text -
> >
> > Thanks all - I think there are cookie issues - here's an example data
> > pair to play with: 6,725,879 (10/102,919).  I'll post some of the code
> > i've tried asap.
>
>
>
> Ok.  Here is what your form looks like without all the <tr> and <td>
> tags:
>
> -------------
> <form name="mfInputForm" method="post" action="/eram/
> getMaintFeesInfo.do;jsessionid=0000U8dQaywwUaYMMuwsl8h4WsX:11g0uehq7">
>
> <input type="text" name="patentNum" maxlength="7" size="7" value="">
> <input type="text" name="applicationNum" maxlength="8" size="8"
> value="">
>
> <input type="hidden" name="signature"
> value="52371786cafc8b58d140bb03ae5a1210">
> <input type="hidden" name="loadTime" value="1203546696130">
> <input type="hidden" name="sessionId" value="U8dQaywwUaYMMuwsl8h4WsX">
>
> <input type="submit" name="maintFeeAction" value="Retrieve Fees to
> Pay">
> <input type="submit" name="maintFeeAction" value="Get Bibliographic
> Data">
> <input type="submit" name="maintFeeAction" value="View Payment
> Windows">
> <input type="submit" name="maintFeeAction" value="View Statement">
>
> for Payment Window:
> <select name="maintFeeYear"><option value="04" selected="selected">04</
> option>
>      <option value="08">08</option>
>      <option value="12">12</option>
> </select>
>
> </form>
> ----------------
>
>
> First notice that there is a <select> tag at the bottom that contains
> some information that would be included in the request if you filled
> out the form by hand and clicked on the submit button.  As a result,
> the name/value pair of that <select> tag needs to be included in your
> request.  That requires that you add the following data to your
> request:
>
> 'maintFeeYear':'04'       #...or whatever you want the value to be
>
>
> Also notice that there are 'hidden' form fields in the form.  They
> look like this:
>
> <input type='hidden' ....>
>
> A hidden form field is not visible on a web page, but just the same
> its name/value pair gets sent to the server when the user submits the
> form.  As a result, you need to include the name/value pairs of the
> hidden form fields in your request.  It so happens that one of the
> hidden form field's name is 'sessionId'.  That id identifies you as
> you move from page to page.  If you click on a link or a button on a
> page, a request is sent out for another page, and if the request does
> not contain that sesssionID, then the request is rejected.
>
> What that means is: you cannot submit a request directly for the page
> you want.  First, you have to send out a request for the page with the
> form on it and then extract some information from it.  What you need
> to do is:
>
> 1) Request the form page.
>
> 2) Extract the name/value pairs in the hidden form fields on the form
> page. BeautifulSoup is good for doing things like that.  You need to
> add those name/value pairs to the dictionary containing the patent
> number and the application number.
>
> 3) The url in the action attribute of the form looks like this:
>
> action="/eram/
> getMaintFeesInfo.do;jsessionid=0000U8dQaywwUaYMMuwsl8h4WsX:11g0uehq7
>
> Note how there is a 'jsessionid' on the end.  What that means is: the
> url in the action attribute changes every time you go to the the form
> page.  As a consequence, you cannot know that url beforehand.  Because
> the information you want is at the url listed in the action attribute,
> you have to extract that url from the form page as well.  Once again,
> BeautifulSoup makes that easy to do.
>
>
> Once you have 1) all the data that is required, and 2) the proper url
> to send your request to, then you can send out your request.  Here is
> an example:
>
> import urllib
> import BeautifulSoup as bs
>
> #get the form page:
>
> response1 = urllib.urlopen('https://ramps.uspto.gov/eram/
> patentMaintFees.do')
>
> #extract the url from the action attribute:
>
> html_doc = bs.BeautifulSoup(response1.read())
> form = html_doc.find('form', attrs={'name':'mfInputForm'})
> action_attr_url = form['action']
> next_page_url = 'https://ramps.uspto.gov' + action_attr_url
>
> #create a dictionary for the data:
>
> form_data = {
>     'patentNum':'6725879',
>     'applicationNum':'10102919',
>     'maintFeeYear': '04',  #<select> name/value
>     'maintFeeAction':'Get Bibliographic Data',  #submit button name/
> value
> }
>
> #extract the data contained in the hidden form fields
> #hidden form fields look like this: <input type='hidden' ...>
>
> hidden_tags = form.findAll('input', type='hidden')
> for tag in hidden_tags:
>     name = tag['name']
>     value = tag['value']
>     print name, value   #if you want to see what's going on
>
>     form_data[name] = value  #add the data to our dictionary
>
> #format the data and send out the request:
>
> enc_data = urllib.urlencode(form_data)
> response2 = urllib.urlopen(next_page_url, enc_data)
>
> print response2.read()
> response2.close()

Throw in a response1.close() here:

> response1 = urllib.urlopen('https://ramps.uspto.gov/eram/
> patentMaintFees.do')
>
> #extract the url from the action attribute:
>
> html_doc = bs.BeautifulSoup(response1.read())
> response1.close()



More information about the Python-list mailing list