[Tutor] Fwd: Re: feedparser in python

Alan Gauld alan.gauld at yahoo.co.uk
Tue Apr 30 11:48:40 EDT 2019


Sharing with the list, comments later. Busy right now.



-------- Forwarded Message --------
Subject: 	Re: [Tutor] feedparser in python
Date: 	Tue, 30 Apr 2019 14:14:35 +0000
From: 	nathan tech <nathan-tech at hotmail.com>
To: 	Alan Gauld <alan.gauld at yahoo.co.uk>



Hi Alan,

Thanks for your emails.

I considered what you said, and came up with a couple of possibilities,
listed below.

Before that, I wanted to clarify what I meant when I said "not working."
I kept meaning to do it, and kept forgetting.

According to the docs:

?????? f=feedparser.parse(url)

Will download a feed, parse it into xml, and return a dict of that feed,
which it does. There are obviously some special things going on with
tha,t because it allows, for instance, f.entries[0].title, rather than
f["entries"][0]["title"].

Anyway.

The docs then say that feedparser will have elements of etag and
modified, which you can then pass in an update, like so:

?????? newfeed=feedparser.parse(url, etag=f.etag, modified=f.modified)

To that end, it would check the headers, and if the feed was not
updated, set newfeed.status to 304.


Which is great, accept... My feeds never have a .etag or a .modified
anywhere.

Even f.get("etag") returns None. which while I could pass it that way,
would mean the feed gets downloaded over and over and over again.

In an example of an rss feed of size 10 MB, that's 240 MB a day, and by
3 days you're over a GIG.


To that end, when I said not working, I meant, nothing I parsed in place
of f.etag and or f.modified seemed to work in that it juts downloaded
the entire feed agai


Now, onto some solutions:

I considered what you said and realised actually, logic says all we need
to know is: is file on local hard drive older than file on web server,
right?

Which lead me briefly to, would os.path.getfilemtime work? Probably not,
but I am curious if there are alternatives to thhat.


In any case, finally I thought, what about f.entries

This is a list of entries in an rss feed.

Even without an update key, which they usually have:

?? date=f.entries[0].updated = "Fri, August 20th 2009"

We could simply do:

?? if(downlaoded_first_entry==f.entries[0]):

???????? # feed is jup to date, so quit.


This is where I got stuck.

urllib2.urlopen() from my calculations, seems to download the file, then
open it?

Is that correct, or is that wrong?

I wrote up this function below:

?????? import urllib2

?????? import time

?????? url="https://www.bigfinish.com/podcasts.rss"

?????? start_time=time.time()

?????? j=urllib2.urlopen(url)

?????? j.close() # lets not be messy

?????? print time.time()-start_time

That came out at 0.8 seconds.

perhaps that is just network connectivity?

but if we remember back to the tests run with the tim function, the
difference in time there was around 1.1 seconds.

The similarities were.. worrying is all.

If urllib2.urlopen doesn't download the file, and merely opens a link
up, as it were, then great.


My theory here is to:

open the web file,

discard any data up to "<item>"

until "</item>" is reached, save the data to a list.

Covnert that list using an xml parser into a dictionary, and then
compare either updated, title, or the whole thing.

If one of them says, this isn't right, download the feed.

If they match, the feed on local drive is up to date.

To be fair, I could clean this up further, and simply have:

until </title> or </updated> is reached save to a list, but that's a
refinement for later.


I'm looking forward to hear your thoughts on this.

I picked up python myself over the course of a year, so am not quite
used to having back and forth like these yet. Especially not with
someone who knows what they're talking about. :)

Thanks

Nate


On 30/04/2019 08:47, Alan Gauld via Tutor wrote:
> On 30/04/2019 00:23, nathan tech wrote:
>
>> The results were as follows:
>>
>> ?????? tim( a url): 2.9 seconds
>>
>> ?????? tim(the downoaded file(: 1.8 seconds
>>
>>
>> That tells me that roughly 1.1 seconds is network related, fair enough.
> Or about 30% of the time.
> Since the network element will increase as data
> size increases as will the parse time it may be
> a near linear relationship. Only more extensive
> tests would tell.
>
>> entire thing again, they all say use ETAG and Modified, but my feeds
>> never, have them.
>>
>> I've tried feeds from several sources, and none have them in the http
>> header.
> Have you looked at the headers to see what they do have?
>
>> To that end, that is why I mentioned in the previous email about .date,
>> because that seemed the most likely, but even that failed.
> Again you tell us that something failed. But don't say
> how it failed. Do you mean that date did not exist?
> Why did you think it would if you had already inspected
> the headers?
>
> Can you share some actual code that you used to check
> these fields? And sow us the actual headers you are
> reading?
>
>> 1, download a feed to the computer.
>>
>> 2. Occasionally, check the website to see if the donloaded feed is out
>> of date if it is, redownload it.
> Seems a good plan. You just need to identify when changes occur.
>
> Even better would be if the sites provided a web API to access
> the data programmatically, but of course few sites do that...
>
>
>> I did think about using threading for this, for example:
>> user sees downloaded feed data only, in the background, the program
>> checks for updates on each feed, and the user may see them gradually
>> start to update.
>>
>> This would work, in that execution would not fail at any time, but it
>> seems... clunky, to me I suppose? And rather data jheavy for the end
>> user, especially if, as you suggest, a feed is 10 MB in size.
> Only data heavy if you download everything. If you only do the
> headers and you only have a relatively few feeds its a good scheme.
>
> As an alternative is there anything in the feed body that identifies
> its creation date? Could you change your parsing mechanism to
> parse the data as it arrives and stop if the date/time has not
> changed? That minimises the download data.
>
>> Furthering to that, how many threads is safe?
> You have a lot of I/O going on so you could run quite a few threads
> without blocking issues. How many feeds do you watch? Logic
> would say have one thread per feed.
>
> But how real time does this really need to be? Would it be
> terrible if updates were, say 1 minute late? If that's the case
> a single threaded solution may be fine. (and much simpler)
> I'd certainly focus on a single threaded solution initially. Get it
> working first then think about performance tuning.
>
>


More information about the Tutor mailing list