Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

Mon May 27 16:56:31 EDT 2013

In article <10be5c62-4c58-4b4f-b00a-82d85ee4ef8e at googlegroups.com>,
 Bryan Britten <britten.bryan at gmail.com> wrote:

> If I use the following code:
> 
> <code>
> import urllib
> 
> urlStr = "https://stream.twitter.com/1/statuses/sample.json"
> 
> fileHandle = urllib.urlopen(urlStr)
> 
> twtrText = fileHandle.readlines()
> </code>
> 
> It takes hours (upwards of 6 or 7, if not more) to finish computing the last 
> command.

I'm not surprised!  readlines() reads in the ENTIRE file in one gulp.  
That a lot of tweets!

> With that being said, my question is whether there is a more efficient manner 
> to do this.

In general, when reading a large file, you want to iterate over lines of 
the file and process each one.  Something like:

for line in urllib.urlopen(urlStr):
   twtrDict = json.loads(line)

You still need to download and process all the data, but at least you 
don't need to store it in memory all at once.  There is an assumption 
here that there's exactly one json object per line.  If that's not the 
case, things might get a little more complicated.