[SciPy-User] OT: Literature management, pdf downloader

John washakie at gmail.com
Mon Feb 28 07:25:45 EST 2011


Sorry for an OT post, but I thought this might be a community that
would have interest in the attached script.

For those of you actively conducting research, I imagine you have a
variety of 'tools' for managing PDFs. For anyone using a Mac, I guess
it's 'Papers', which seems to be quite brilliant software. On Linux,
I've gone with Mendeley, which I am very pleased with. For my actual
searching, I rely on the webofscience or ISI searches.

Here is my process:

1) ISI search for articles, add to 'marked list'
2) export marked list to bibtex
3) download pdf files to which I have access
4) dump them into a 'staging' folder for Mendeley
5) let Mendeley import them into my library (making copies)

This has worked very well, but recently I became frustrated with the
amount of time I spent downloading articles. I decided to write a
script to do it for me. Attached you'll find a script which uses the
DOI numbers (if present) and essentially accomplishes steps 3 & 4
above. I would like to add this eventually as functionality to either
Mendeley or kbibtex or pybibliographer. The functionality I see is
that you could select some references in any of the aforementioned
software, and then click a 'download PDFs' button.

Does this exist at all?!? If so, please let me know.

Okay, so assuming it does not, in the attached script, you'll see that
what it does is to parse a bibtex file to extract the DOI numbers. If
they don't exist, the article is skipped, SOL. If the DOI number is
available it then accesses the dx.doi.org website to figure out where
to get the article. Then after some 'screen scraping' the link to the
pdf is used to download the PDF to a 'LIBRARY' directory. Of course
the major assumption here is that you have access to the articles
through your network.

There are some outstanding issues, and in general this is an email
reaching out to more experienced programmers for comments on the
following:

1) I need to do Error handling better, (i.e. at least I should have a timeout)
2) I would like to be able to include authentication handling (maybe
in a config file, you could provide access credentials for various
journals)
3) Getting rid of the BeautifulSoup and pybtex dependency (or learn
how to package so that when someone uses easy install, those
dependencies will also be installed)
4) I need to be able to handle cookies (this is a problem so far only
for the get_acs method).
5) Are my various journal methods the best way to do this??

If folks object to my posting this here, please suggest a place you
might think would be more appropriate.

If I get positive feedback, I'll post this to a public site where
version control can be done so folks can do their own legwork to add
'screen scraper' methods for other journals.

All the best,
john
-------------- next part --------------
A non-text attachment was scrubbed...
Name: get_publications.py
Type: text/x-python
Size: 7057 bytes
Desc: not available
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20110228/90398d53/attachment.py>


More information about the SciPy-User mailing list