Querying a complex website

schweet1 jon.kappes at gmail.com
Fri Feb 22 11:46:28 EST 2008


On Feb 20, 6:06 pm, 7stud <bbxx789_0... at yahoo.com> wrote:
> 7stud wrote:
> > schweet1 wrote:
> > > On Feb 19, 4:04�pm, 7stud <bbxx789_0... at yahoo.com> wrote:
> > > > schweet1 wrote:
> > > > > Greetings,
>
> > > > > I am attempting to use python to submit a query to the following URL:
>
> > > > >https://ramps.uspto.gov/eram/patentMaintFees.do
>
> > > > > The page looks simple enough - it requires submitting a number into 2
> > > > > form boxes and then selecting from the pull down.
>
> > > > > However, my test scripts have been hung up, apparently due to the
> > > > > several buttons on the page having the same name. �Ideally, I would
> > > > > have the script use the "Get Bibligraphic Data" link.
>
> > > > > Any assistance would be appreciated.
>
> > > > > ~Jon
>
> > > > This is the section you are interested in:
>
> > > > -------------
> > > > <tr>
> > > > <td colspan=3><input type="submit" name="maintFeeAction"
> > > > value="Retrieve Fees to Pay"> </td>
> > > > </tr>
>
> > > > <tr>
> > > > <td colspan=3><input type="submit" name="maintFeeAction" value="Get
> > > > Bibliographic Data"> </td>
> > > > </tr>
>
> > > > <tr>
> > > > <td colspan=3><input type="submit" name="maintFeeAction" value="View
> > > > Payment Windows"> </td>
> > > > </tr>
> > > > <tr>
> > > > ------------
>
> > > > 1) When you click on a submit button on a web page, a request is sent
> > > > out for the web page listed in the action attribute of the <form> tag,
> > > > which in this case is:
>
> > > > <form name="mfInputForm" method="post" action="/eram/
> > > > getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:11g0uepfb">
>
> > > > The url specified in the action attribute is a relative url. �The
> > > > current url in the address bar of your browser window is:
>
> > > >https://ramps.uspto.gov/eram/patentMaintFees.do
>
> > > > and if you compare that to the url in the action attribute of the
> > > > <form> tag:
>
> > > > ---------https://ramps.uspto.gov/eram/patentMaintFees.do
>
> > > > /eram/getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:
> > > > 11g0uepfb
> > > > ---------
>
> > > > you can piece them together and get the absolute url:
>
> > > >https://ramps.uspto.gov/eram/getMaintFeesInfo.do;jsessionid=0000-MCoY...
>
> > > > 2) When you click on a submit button, a request is sent to that url.
> > > > The request will contain all the information you entered into the form
> > > > as name/value pairs. �The name is whatever is specified in the name
> > > > attribute of a tag and the value is whatever is entered into the form.
>
> > > > Because the submit buttons in the form have name attributes, �the name
> > > > and value of the particular submit button that you click will be added
> > > > to the request.
>
> > > > 3) �To programmatically mimic what happens in your browser when you
> > > > click on the submit button of a form, you need to send a request
> > > > directly to the url listed in the action attribute of the <form>.
> > > > Your request will contain the name/value pairs that would have been
> > > > sent to the server if you had actually filled out the form and clicked
> > > > on the 'Get Bibliographic Data' submit button. �The form contains
> > > > these input elements:
>
> > > > ----
> > > > <input type="text" name="patentNum" maxlength="7" size="7" value="">
>
> > > > <input type="text" name="applicationNum" maxlength="8" size="8"
> > > > value="">
> > > > ----
>
> > > > and the submit button you want to click on is this one:
>
> > > > <input type="submit" name="maintFeeAction" value="Get Bibliographic
> > > > Data">
>
> > > > So the name value pairs you need to include in your request are:
>
> > > > data = {
> > > > � � 'patentNum':'1234567',
> > > > � � 'applicationNum':'08123456',
> > > > � � 'maintFeeAction':'Get Bibliographic Data'
>
> > > > }
>
> > > > Therefore, try something like this:
>
> > > > import urllib
>
> > > > data = {
> > > > � � 'patentNum':'1234567',
> > > > � � 'applicationNum':'08123456',
> > > > � � 'maintFeeAction':'Get Bibliographic Data'
>
> > > > }
>
> > > > enc_data = urllib.urlencode(data)
> > > > url = 'https://ramps.uspto.gov/eram/
> > > > getMaintFeesInfo.do;jsessionid=0000-MCoYNbJsaUCr2VfzZhKILX:11g0uepfb'
>
> > > > f = urllib.urlopen(url, enc_data)
>
> > > > print f.read()
> > > > f.close()
>
> > > > If that doesn't work, you may need to deal with cookies that the
> > > > server requires in order to keep track of you as you navigate from
> > > > page to page. �In that case, please post a valid patent number and
> > > > application number, so that I can do some further tests.- Hide quoted text -
>
> > > > - Show quoted text -
>
> > > Thanks all - I think there are cookie issues - here's an example data
> > > pair to play with: 6,725,879 (10/102,919).  I'll post some of the code
> > > i've tried asap.
>
> > Ok.  Here is what your form looks like without all the <tr> and <td>
> > tags:
>
> > -------------
> > <form name="mfInputForm" method="post" action="/eram/
> > getMaintFeesInfo.do;jsessionid=0000U8dQaywwUaYMMuwsl8h4WsX:11g0uehq7">
>
> > <input type="text" name="patentNum" maxlength="7" size="7" value="">
> > <input type="text" name="applicationNum" maxlength="8" size="8"
> > value="">
>
> > <input type="hidden" name="signature"
> > value="52371786cafc8b58d140bb03ae5a1210">
> > <input type="hidden" name="loadTime" value="1203546696130">
> > <input type="hidden" name="sessionId" value="U8dQaywwUaYMMuwsl8h4WsX">
>
> > <input type="submit" name="maintFeeAction" value="Retrieve Fees to
> > Pay">
> > <input type="submit" name="maintFeeAction" value="Get Bibliographic
> > Data">
> > <input type="submit" name="maintFeeAction" value="View Payment
> > Windows">
> > <input type="submit" name="maintFeeAction" value="View Statement">
>
> > for Payment Window:
> > <select name="maintFeeYear"><option value="04" selected="selected">04</
> > option>
> >      <option value="08">08</option>
> >      <option value="12">12</option>
> > </select>
>
> > </form>
> > ----------------
>
> > First notice that there is a <select> tag at the bottom that contains
> > some information that would be included in the request if you filled
> > out the form by hand and clicked on the submit button.  As a result,
> > the name/value pair of that <select> tag needs to be included in your
> > request.  That requires that you add the following data to your
> > request:
>
> > 'maintFeeYear':'04'       #...or whatever you want the value to be
>
> > Also notice that there are 'hidden' form fields in the form.  They
> > look like this:
>
> > <input type='hidden' ....>
>
> > A hidden form field is not visible on a web page, but just the same
> > its name/value pair gets sent to the server when the user submits the
> > form.  As a result, you need to include the name/value pairs of the
> > hidden form fields in your request.  It so happens that one of the
> > hidden form field's name is 'sessionId'.  That id identifies you as
> > you move from page to page.  If you click on a link or a button on a
> > page, a request is sent out for another page, and if the request does
> > not contain that sesssionID, then the request is rejected.
>
> > What that means is: you cannot submit a request directly for the page
> > you want.  First, you have to send out a request for the page with the
> > form on it and then extract some information from it.  What you need
> > to do is:
>
> > 1) Request the form page.
>
> > 2) Extract the name/value pairs in the hidden form fields on the form
> > page. BeautifulSoup is good for doing things like that.  You need to
> > add those name/value pairs to the dictionary containing the patent
> > number and the application number.
>
> > 3) The url in the action attribute of the form looks like this:
>
> > action="/eram/
> > getMaintFeesInfo.do;jsessionid=0000U8dQaywwUaYMMuwsl8h4WsX:11g0uehq7
>
> > Note how there is a 'jsessionid' on the end.  What that means is: the
> > url in the action attribute changes every time you go to the the form
> > page.  As a consequence, you cannot know that url beforehand.  Because
> > the information you want is at the url listed in the action attribute,
> > you have to extract that url from the form page as well.  Once again,
> > BeautifulSoup makes that easy to do.
>
> > Once you have 1) all the data that is required, and 2) the proper url
> > to send your request to, then you can send out your request.  Here is
> > an example:
>
> > import urllib
> > import BeautifulSoup as bs
>
> > #get the form page:
>
> > response1 = urllib.urlopen('https://ramps.uspto.gov/eram/
> > patentMaintFees.do')
>
> > #extract the url from the action attribute:
>
> > html_doc = bs.BeautifulSoup(response1.read())
> > form = html_doc.find('form', attrs={'name':'mfInputForm'})
> > action_attr_url = form['action']
> > next_page_url = 'https://ramps.uspto.gov'+ action_attr_url
>
> > #create a dictionary for the data:
>
> > form_data = {
> >     'patentNum':'6725879',
> >     'applicationNum':'10102919',
> >     'maintFeeYear': '04',  #<select> name/value
> >     'maintFeeAction':'Get Bibliographic Data',  #submit button name/
> > value
> > }
>
> > #extract the data contained in the hidden form fields
> > #hidden form fields look like this: <input type='hidden' ...>
>
> > hidden_tags = form.findAll('input', type='hidden')
> > for tag in hidden_tags:
> >     name = tag['name']
> >     value = tag['value']
> >     print name, value   #if you want to see what's going on
>
> >     form_data[name] = value  #add the data to our dictionary
>
> > #format the data and send out the request:
>
> > enc_data = urllib.urlencode(form_data)
> > response2 = urllib.urlopen(next_page_url, enc_data)
>
> > print response2.read()
> > response2.close()
>
> Throw in a response1.close() here:
>
> > response1 = urllib.urlopen('https://ramps.uspto.gov/eram/
> > patentMaintFees.do')
>
> > #extract the url from the action attribute:
>
> > html_doc = bs.BeautifulSoup(response1.read())
> > response1.close()

Thanks a million.  This worked for me.



More information about the Python-list mailing list