Difficulties POSTing to RDP Hierarchy Browse Page

Chris Lasher chris.lasher at gmail.com
Fri Dec 3 12:05:15 EST 2004


Hello,
  I'm trying to write a tool to scrape through some of the Ribosomal
Database Project II's (http://rdp.cme.msu.edu/) pages, specifically,
through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)
  The Hierarchy Browser is accessed first through a page with a form.
There are four fields with several options to be chosen from (Strain,
Source, Size, and Taxonomy) and then a submit button labeled "Browse".
The HTML of the form is as follows (note, I am also including the
Javascript code, as it is called by the submit button):

--------excerpted HTML----------------
<script language="Javascript">

function resetHiddenVar(){
    var f_form = document.forms['hierarchyForm'];     
    f_form.action= "HierarchyControllerServlet/start";    
    return ;
}

</script>

<form name="hierarchyForm" method="POST"
action="HierarchyControllerServlet/start/">
<input type='hidden' name='printParams' value='no' /> 

<h1>Hierarchy Browser - Start</h1><div class="cart" style="float:
right">[ <a href="hb_help.jsp">help</a> ]</div>

<p> </p>
<div id="options">
 
    
  
  <table summary="options area" cellpadding="0" cellspacing="0"
border="0"><tr><td align="left" valign="middle">
<table  border="0" cellspacing="0" cellpadding="0" summary="Options"
align="left" class="borderup">
<tr>
<th align="right" valign="middle" class="bottom greenbg"
nowrap="nowrap">Strain:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="type"
name="strain" type="radio" value="type">
 <label for="type">Type</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="nontype"
name="strain" type="radio" value="nontype">
<label for="nontype">Non Type</label> </td>
<td class="bottom formtext" nowrap="nowrap"><input name="strain"
type="radio" id="strainboth" value="both" checked>
<label for="strainboth">Both</label> </td>

</tr>
<tr>
<th align="right" valign="middle" class="bottom greenbg">Source:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="environmental"
name="source" type="radio" value="environ">
 <label for="environmental">Uncultured </label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="isolates"
name="source" type="radio" value="isolates">
 <label for="isolates">Isolates</label></td>
<td class="bottom formtext" nowrap="nowrap"><input name="source"
type="radio" id="sourceboth" value="both" checked >
 <label for="sourceboth">Both</label></td>
</tr>

<tr>
<th align="right" valign="middle" class="bottom greenbg">Size:</th>
<td class="bottom formtext" nowrap="nowrap"><input
id="greaterthan1200" name="size" type="radio" value="gt1200" checked>
 <label for="greaterthan1200"><u>></u>1200</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="lessthan1200"
name="size" type="radio" value="lt1200">
 <label for="lessthan1200"><1200</label></td>
<td class="bottom formtext" nowrap="nowrap"><input id="sizeboth"
name="size" type="radio" value="both">
 <label for="sizeboth">Both</label></td>
</tr>
<tr>

<th align="right" valign="middle" class="bottom
greenbg">Taxonomy:</th>
<td class="bottom formtext" nowrap="nowrap"><input id="bergeys"
name="taxonomy" type="radio" value="rdpHome" checked>
 <label for="bergeys">Bergey's</label></td>
<td colspan="2" class="bottom formtext" nowrap="nowrap"><input
id="ncbi" name="taxonomy" type="radio" value="ncbiHome">
 <label for="ncbi">NCBI</label></td>
</tr>
  </table>
</td>
<td align="left" valign="middle">   
  

        <input name="browse" type="submit" id="browse"
onclick="resetHiddenVar(); return true;" value="Browse">



</td></tr></table></p>
</div>
<!-- end options -->
</form>
----------end excerpted HTML--------------


The options I would like to simulate are browsing by strain=type,
source=both, size = gt1200, and taxonomy = bergeys. I see that the
form method is POST, and I read through the urllib documentation, and
saw that the syntax for POSTing is urllib.urlopen(url, data). Since
the submit button calls HierarchyControllerServlet/start (see the
Javascript), I figure that the url I should be contacting is
http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start
Thus, I came up with the following test code:

--------Python test code---------------
#!/usr/bin/python

import urllib

options = [("strain", "type"), ("source", "both"),
           ("size", "gt1200"), ("taxonomy", "bergeys"),
           ("browse", "Browse")]

params = urllib.urlencode(options)

rdpbrowsepage = urllib.urlopen(
    "http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start",
    params)

pagehtml = rdpbrowsepage.read()

print pagehtml
---------end Python test code----------


However, the page that is returned is an error page that says the
request could not be completed. The correct page should show various
bacterial taxonomies, which are clickable to reveal greater detail of
that particular taxon.

I'm a bit stumped, and admittedly, I am in over my head on the subject
matter of networking and web-clients. Perhaps I should be using the
httplib module for connecting to the RDP instead, but I am unsure what
methods I need to use to do this. This is complicated by the fact that
these are JSP generated pages and I'm unsure what exactly the server
requires before giving up the desired page. For instance, there's a
jsessionid that's given and I'm unsure if this is required to access
pages, and if it is, how to place it in POST requests.

If anyone has suggestions, I would greatly appreciate them. If any
more information is needed that I haven't provided, please let me know
and I'll be happy to give what I am able. Thanks very, very much in
advance.

Chris



More information about the Python-list mailing list