Accessing a Web server --- how?

Dave Angel davea at ieee.org
Tue Nov 17 10:02:13 EST 2009



Virgil Stokes wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">If one 
> goes to the following URL:
> http://www.nordea.se/Privat/Spara%2boch%2bplacera/Strukturerade%2bprodukter/Aktieobligation%2bNr%2b99%2bEuropa%2bAlfa/973822.html 
>
>
> it contains a link (click on "Current courses NBD AT99 3113A") to:
> http://service.nordea.com/nordea-openpages/six.action?target=/nordea.public/bond/nordeabond.page&magic=%28cc+%28detail+%28tsid+310746%29%29%29& 
>
>
> and if you now click on the tab labeled "history and compare" this 
> will take you to:
> http://service.nordea.com/nordea-openpages/six.action?target=/nordea.public/bond/nordeabond.page&magic=%28cc+%28detail+%28tsid+310746%29+%28view+hist%29%29%29& 
>
>
> Finally...This is where I would like to "connect to" the data on a 
> daily basis or to gather data over different time intervals. I believe 
> that if I can get some help on this, then I will be able to customize 
> the code as needed for my own purposes.
>
> It should be clear that this is financial data on a fond managed by 
> Nordea Bank AB. Nordea is one of the largest banks in Scandinavia.
>
> Note, that I do have some experience with Python (2.6 mainly), and 
> find it a very useful and powerful language. However, I have no 
> experience with it in the area of Web services. Any 
> suggestions/comments on how to set up this financial data service 
> project would be greatly appreciated, and I would be glad to share 
> this project with any interested parties.
>
> Note, I posted a similar message to the list pywebsvcs; but, received 
> no responses.
>
> -- V. Stokes
>
>
I still say you should contact the bank and see if they have any API or 
interface defined, so you don't have to do web-scraping.

The following text is on the XHTML page for that last link:

<table class="tableb3" summary="Kurser för en obligation.">
<caption class="hide">Nordea Bank Finland Abp utfärdad av Nordea Bank Finland Abp</caption>
<thead>
<tr>

<th  class="alignleft" scope="col">Börskod</th>
<th  class="alignright" scope="col">Köp</th>
<th  class="alignright" scope="col">Sälj</th>
<th  class="alignright" scope="col">Senast</th>
<th  class="alignright" scope="col">Förfallodag</th>
<th  class="alignright" scope="col">Tid</th>
</tr>
</thead>
<tbody>
<tr>
<td class="alignleft"> NBF AT99 3113A</td>

<td class="nowrap alignright"> 95,69</td>
<td class="nowrap alignright"> 97,69</td>
<td class="nowrap alignright"> 95,69</td>
<td class="alignright"> 2011-06-03</td>
<td class="alignright"> 12:33</td>
</tr>
</tbody>
</table>



I didn't try it, but you could presumably use urllib2 to download that 
url (prob. to a file, so you can repeat the test often without loading 
the server).  One caution, it did ask to store a cookie, and I know 
nothing about cookie handling in Python.

Several cautions:  I don't know how target= and magic= were derived, or 
whether they'll remain stable for more than a day or so.  So you can 
download this file and figure how to parse it, but you'll probably need 
to also parse the earlier pages, and that could be easier or harder.

This page format is very straightforward.  If you know you're looking 
for NBF AT99, you could look for that particular line, then just parse 
all the td's till the next /tr.   No XML logic needed.  If you don't 
know the NBF string, you could look for  

Börskod  instead.

But the big risk you run is the bank could easily change this format quite drastically, at any time.  Those td
elements don't have to be on separate lines, the browser doesn't care.  And the class attribute could change
if the CSS also changes correspondingly.  Or they could come up with an entirely different way to display the
data.  All they care about is whether it's readable by the human looking at the browser page.

Using xml..elementtree would be a good start;  You could build the DOM, look for the table of class 'tableb3', 
and go in from there

But you still run the risk of them changing things.  The class name, for example, is just a link to the CSS page which
describes how that class object should be displayed.  If the name is changed at both ends, no change occurs, 
except to your script.


At this point, you need to experiment.  But build a sloppy skeleton 
first, so you don't invest too much time in any one aspect of the 
problem.  Make sure you can cover the corner cases, then fill in the 
tough parts.

I'd say roughly this order:
1.   write code that download the page to a file, given an exact URL.  
For now, keep that code separate, as it'll probably end up
being much more complex, walking through other pages.

2.  parse that page, using a simple for loop that looks for some of the 
key strings mentioned above.

3. Repeat that for a few different URL's, presumably one per bond fund.

4. Make sure the URL's don't go stale over a few days.  If they do, 
you'll have to back up to an earlier link (URL), and parse forward from 
there.


Keep the various pieces in different modules, so that when an assumption 
breaks, you can recode that assumption pretty much independent of the 
others.


HTH
DaveA




More information about the Python-list mailing list