Querying a complex website

7stud bbxx789_05ss at yahoo.com
Wed Feb 20 19:56:47 EST 2008



schweet1 wrote:
> On Feb 19, 4:04�pm, 7stud <bbxx789_0... at yahoo.com> wrote:
> > schweet1 wrote:
> > > Greetings,
> >
> > > I am attempting to use python to submit a query to the following URL:
> >
> > >https://ramps.uspto.gov/eram/patentMaintFees.do
> >
> > > The page looks simple enough - it requires submitting a number into 2
> > > form boxes and then selecting from the pull down.
> >
> > > However, my test scripts have been hung up, apparently due to the
> > > several buttons on the page having the same name. �Ideally, I would
> > > have the script use the "Get Bibligraphic Data" link.
> >
> > > Any assistance would be appreciated.
> >
> > > ~Jon
> >
> > This is the section you are interested in:
> >
> > -------------
> > <tr>
> > <td colspan=3><input type="submit" name="maintFeeAction"
> > value="Retrieve Fees to Pay"> </td>
> > </tr>
> >
> > <tr>
> > <td colspan=3><input type="submit" name="maintFeeAction" value="Get
> > Bibliographic Data"> </td>
> > </tr>
> >
> > <tr>
> > <td colspan=3><input type="submit" name="maintFeeAction" value="View
> > Payment Windows"> </td>
> > </tr>
> > <tr>
> > ------------
> >
> > 1) When you click on a submit button on a web page, a request is sent
> > out for the web page listed in the action attribute of the <form> tag,
> > which in this case is:
> >
> > <form name="mfInputForm" method="post" action="/eram/
> > getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:11g0uepfb">
> >
> > The url specified in the action attribute is a relative url. �The
> > current url in the address bar of your browser window is:
> >
> > https://ramps.uspto.gov/eram/patentMaintFees.do
> >
> > and if you compare that to the url in the action attribute of the
> > <form> tag:
> >
> > ---------https://ramps.uspto.gov/eram/patentMaintFees.do
> >
> > /eram/getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:
> > 11g0uepfb
> > ---------
> >
> > you can piece them together and get the absolute url:
> >
> > https://ramps.uspto.gov/eram/getMaintFeesInfo.do;jsessionid=0000-MCoY...
> >
> > 2) When you click on a submit button, a request is sent to that url.
> > The request will contain all the information you entered into the form
> > as name/value pairs. �The name is whatever is specified in the name
> > attribute of a tag and the value is whatever is entered into the form.
> >
> > Because the submit buttons in the form have name attributes, �the name
> > and value of the particular submit button that you click will be added
> > to the request.
> >
> > 3) �To programmatically mimic what happens in your browser when you
> > click on the submit button of a form, you need to send a request
> > directly to the url listed in the action attribute of the <form>.
> > Your request will contain the name/value pairs that would have been
> > sent to the server if you had actually filled out the form and clicked
> > on the 'Get Bibliographic Data' submit button. �The form contains
> > these input elements:
> >
> > ----
> > <input type="text" name="patentNum" maxlength="7" size="7" value="">
> >
> > <input type="text" name="applicationNum" maxlength="8" size="8"
> > value="">
> > ----
> >
> > and the submit button you want to click on is this one:
> >
> > <input type="submit" name="maintFeeAction" value="Get Bibliographic
> > Data">
> >
> > So the name value pairs you need to include in your request are:
> >
> > data = {
> > � � 'patentNum':'1234567',
> > � � 'applicationNum':'08123456',
> > � � 'maintFeeAction':'Get Bibliographic Data'
> >
> > }
> >
> > Therefore, try something like this:
> >
> > import urllib
> >
> > data = {
> > � � 'patentNum':'1234567',
> > � � 'applicationNum':'08123456',
> > � � 'maintFeeAction':'Get Bibliographic Data'
> >
> > }
> >
> > enc_data = urllib.urlencode(data)
> > url = 'https://ramps.uspto.gov/eram/
> > getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:11g0uepfb'
> >
> > f = urllib.urlopen(url, enc_data)
> >
> > print f.read()
> > f.close()
> >
> > If that doesn't work, you may need to deal with cookies that the
> > server requires in order to keep track of you as you navigate from
> > page to page. �In that case, please post a valid patent number and
> > application number, so that I can do some further tests.- Hide quoted text -
> >
> > - Show quoted text -
>
> Thanks all - I think there are cookie issues - here's an example data
> pair to play with: 6,725,879 (10/102,919).  I'll post some of the code
> i've tried asap.



Ok.  Here is what your form looks like without all the <tr> and <td>
tags:

-------------
<form name="mfInputForm" method="post" action="/eram/
getMaintFeesInfo.do;jsessionid=0000U8dQaywwUaYMMuwsl8h4WsX:11g0uehq7">

<input type="text" name="patentNum" maxlength="7" size="7" value="">
<input type="text" name="applicationNum" maxlength="8" size="8"
value="">

<input type="hidden" name="signature"
value="52371786cafc8b58d140bb03ae5a1210">
<input type="hidden" name="loadTime" value="1203546696130">
<input type="hidden" name="sessionId" value="U8dQaywwUaYMMuwsl8h4WsX">

<input type="submit" name="maintFeeAction" value="Retrieve Fees to
Pay">
<input type="submit" name="maintFeeAction" value="Get Bibliographic
Data">
<input type="submit" name="maintFeeAction" value="View Payment
Windows">
<input type="submit" name="maintFeeAction" value="View Statement">

for Payment Window:
<select name="maintFeeYear"><option value="04" selected="selected">04</
option>
     <option value="08">08</option>
     <option value="12">12</option>
</select>

</form>
----------------


First notice that there is a <select> tag at the bottom that contains
some information that would be included in the request if you filled
out the form by hand and clicked on the submit button.  As a result,
the name/value pair of that <select> tag needs to be included in your
request.  That requires that you add the following data to your
request:

'maintFeeYear':'04'       #...or whatever you want the value to be


Also notice that there are 'hidden' form fields in the form.  They
look like this:

<input type='hidden' ....>

A hidden form field is not visible on a web page, but just the same
its name/value pair gets sent to the server when the user submits the
form.  As a result, you need to include the name/value pairs of the
hidden form fields in your request.  It so happens that one of the
hidden form field's name is 'sessionId'.  That id identifies you as
you move from page to page.  If you click on a link or a button on a
page, a request is sent out for another page, and if the request does
not contain that sesssionID, then the request is rejected.

What that means is: you cannot submit a request directly for the page
you want.  First, you have to send out a request for the page with the
form on it and then extract some information from it.  What you need
to do is:

1) Request the form page.

2) Extract the name/value pairs in the hidden form fields on the form
page. BeautifulSoup is good for doing things like that.  You need to
add those name/value pairs to the dictionary containing the patent
number and the application number.

3) The url in the action attribute of the form looks like this:

action="/eram/
getMaintFeesInfo.do;jsessionid=0000U8dQaywwUaYMMuwsl8h4WsX:11g0uehq7

Note how there is a 'jsessionid' on the end.  What that means is: the
url in the action attribute changes every time you go to the the form
page.  As a consequence, you cannot know that url beforehand.  Because
the information you want is at the url listed in the action attribute,
you have to extract that url from the form page as well.  Once again,
BeautifulSoup makes that easy to do.


Once you have 1) all the data that is required, and 2) the proper url
to send your request to, then you can send out your request.  Here is
an example:

import urllib
import BeautifulSoup as bs

#get the form page:

response1 = urllib.urlopen('https://ramps.uspto.gov/eram/
patentMaintFees.do')

#extract the url from the action attribute:

html_doc = bs.BeautifulSoup(response1.read())
form = html_doc.find('form', attrs={'name':'mfInputForm'})
action_attr_url = form['action']
next_page_url = 'https://ramps.uspto.gov' + action_attr_url

#create a dictionary for the data:

form_data = {
    'patentNum':'6725879',
    'applicationNum':'10102919',
    'maintFeeYear': '04',  #<select> name/value
    'maintFeeAction':'Get Bibliographic Data',  #submit button name/
value
}

#extract the data contained in the hidden form fields
#hidden form fields look like this: <input type='hidden' ...>

hidden_tags = form.findAll('input', type='hidden')
for tag in hidden_tags:
    name = tag['name']
    value = tag['value']
    print name, value   #if you want to see what's going on

    form_data[name] = value  #add the data to our dictionary

#format the data and send out the request:

enc_data = urllib.urlencode(form_data)
response2 = urllib.urlopen(next_page_url, enc_data)

print response2.read()
response2.close()








More information about the Python-list mailing list