[Tutor] Retrieving data from a web site

Peter Otten __peter__ at web.de
Mon May 20 09:55:14 CEST 2013


Phil wrote:

> On 19/05/13 18:05, Peter Otten wrote:
>>
>> The original Python 2 code:
>>
>>   $ cat parse.py
>> import urllib2
>> import json
>>
>> url = "http://*********/goldencasket"
>> s = urllib2.urlopen(url).read()
>>
>> s = s.partition("latestResults_productResults")[2].lstrip(" =")
>> s = s.partition(";")[0]
>> data = json.loads(s)
>> lotto = data["GoldLottoSaturday"]
>> print lotto["drawDayDateNumber"]
>> print map(int, lotto["primaryNumbers"])
>> print map(int, lotto["secondaryNumbers"])
>> $ python parse.py
>> Sat 18/May/13, Draw 3321
>> [14, 31, 16, 25, 6, 3]
>> [9, 35]
>>
> 
> It turns out that urllib2 and 3 are both built into python so I didn't

To make it crystal clear: despite the suggestive name urllib3 is an external 
library separate from urllib2. 

urllib2 is in the standard library of Python 2. In Python 3 the standard lib 
has been reorganized, and the functionality of urllib2 needed for the above 
script is now in urllib.request.

> have to stress over the dependency error. However, I do have an error
> and I'm not completely certain that I understand how the code provided
> by Peter works. The following is the error message:
> 
> Traceback (most recent call last):
>    File "/home/phil/Python/lotto.py", line 10, in <module>
>      data = json.loads(s)
>    File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
>      return _default_decoder.decode(s)
>    File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
>      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
>    File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
>      raise ValueError("No JSON object could be decoded")
> ValueError: No JSON object could be decoded

I've rerun the script and it still works over here. I'm in Germany, though, 
and therefore there's a small chance that I'm being served different data.
What does

import urllib2

url = "http://tatts.com/goldencasket"
s = urllib2.urlopen(url).read()

s = s.partition("latestResults_productResults")[2].lstrip(" =")
print s[:100]
s = s.partition(";")[0]
print s[-100:]

print?

> I'm not completely certain that I understand how the code provided
> by Peter works.

I've looked into the page source and found something like

<html>
...
<script type="text/javascript">
...
var latestResults_productResults = {"GoldLottoSaturday":
{"drawDayDateNumber":"Sat 18/May/13, Draw 3321","primaryNumbers":
["14","31","16","25","6","3"],"secondaryNumbers":
["9","35"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/tattersalls/Draw 
videos/TattsLottodraw3321180513.flv"},"OZ7Lotto":{"drawDayDateNumber":"Tue 
14/May/13, Draw 1004","primaryNumbers":
["13","9","1","12","26","22","41"],"secondaryNumbers":
["34","39"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/tattersalls/Draw 
videos/Super7sOzLottodraw1004140513.flv"},"QLDPowerball":
{"drawDayDateNumber":"Thu 16/May/13, Draw 887","primaryNumbers":
["24","26","31","23","40","4"],"secondaryNumbers":
["13"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/tattersalls/Draw 
videos/Powerballdraw887160513.flv"},"WednesdayLotto":
{"drawDayDateNumber":"Wed 15/May/13, Draw 3320","primaryNumbers":
["26","3","27","24","17","33"],"secondaryNumbers":
["1","22"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/goldencasket/Draw 
videos/WednesdayLottodraw3320150513.flv"},"QLDSuper66":
{"drawDayDateNumber":"Sat 18/May/13, Draw 3321","primaryNumbers":
["7","1","7","5","1","6"],"secondaryNumbers":
[],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":""},"QLDPools":
{"drawDayDateNumber":"Sat 18/May/13, Draw 1451","primaryNumbers":
["1","2","24","27","28","35"],"secondaryNumbers":
["36"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":""}};
...


If all goes well, with 

s = s.partition("latestResults_productResults")[2]

everything before  " = {"GoldLottoSaturday":" is removed,
with s = s.lstrip(" =") the leading "=" and " " are removed, too.

s = s.partition(";")[0] removes everything after the nested dicts, so now we 
have

"""{"GoldLottoSaturday":{"drawDayDateNumber":"Sat 18/May/13, Draw 
3321","primaryNumbers":["14","31","16","25","6","3"],"secondaryNumbers":
["9","35"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/tattersalls/Draw 
videos/TattsLottodraw3321180513.flv"},"OZ7Lotto":{"drawDayDateNumber":"Tue 
14/May/13, Draw 1004","primaryNumbers":
["13","9","1","12","26","22","41"],"secondaryNumbers":
["34","39"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/tattersalls/Draw 
videos/Super7sOzLottodraw1004140513.flv"},"QLDPowerball":
{"drawDayDateNumber":"Thu 16/May/13, Draw 887","primaryNumbers":
["24","26","31","23","40","4"],"secondaryNumbers":
["13"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/tattersalls/Draw 
videos/Powerballdraw887160513.flv"},"WednesdayLotto":
{"drawDayDateNumber":"Wed 15/May/13, Draw 3320","primaryNumbers":
["26","3","27","24","17","33"],"secondaryNumbers":
["1","22"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":"//media.tatts.com/Lotto/goldencasket/Draw 
videos/WednesdayLottodraw3320150513.flv"},"QLDSuper66":
{"drawDayDateNumber":"Sat 18/May/13, Draw 3321","primaryNumbers":
["7","1","7","5","1","6"],"secondaryNumbers":
[],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":""},"QLDPools":
{"drawDayDateNumber":"Sat 18/May/13, Draw 1451","primaryNumbers":
["1","2","24","27","28","35"],"secondaryNumbers":
["36"],"hasDividends":true,"jurisdictionPath":"goldencasket","watchdrawUrl":""}}"""

a data structure of nested javascript dicts and lists. With 

data = json.loads(s)

this string is converted into a data structure of nested python dicts and 
lists. I've warned you before that this is brittle. For example if one of 
the strings in the data structure contains a ";" s. = s.partition(";")[0] 
will preserve only part of the structure.



More information about the Tutor mailing list