RDFXML Parser For "qualified Dublin Core" Database File

Steve Holden steve at holdenweb.com
Wed May 30 02:29:45 EDT 2007


Brandon McGinty wrote:
> Hi All,
> 
> My goal is to be able to read the www.gutenberg.org 
> <http://www.gutenberg.org/> rdf catalog, parse it into a python 
> structure, and pull out data for each record.
> 
> The catalog is a Dublin core RDF/XML catalog, divided into sections for 
> each book and details for that book.
> 
> I have done a very large amount of research on this problem.
> 
> I’ve tried tools such as pyrple, sax/dom/minidom, and some others both 
> standard and nonstandard to a python installation.
> 
> None of the tools has been able to read this file successfully, and 
> those that can even see the data can take up to half an hour to load 
> with 2 gb of ram.
> 
> So you all know what I’m talking about, the file is located at:
> 
> http://www.gutenberg.org/feeds/catalog.rdf.bz2
> 
> Does anyone have suggestions for a parser or converter, so I’d be able 
> to view this file, and extract data?
> 
> Any help is appreciated.
> 
Well, have you tried xml.etree.cElementTree, a part of the standard 
library since 2.5? Well worth a go, as it seems to outperform many XML 
libraries.

The iterparse function is your best bet, allowing you to iterate over 
the events as you parse the source, thus avoiding the need to build a 
huge in-memory data structure just to get the parsing done.

The following program took about four minutes to run on my not-terribly 
up-to-date Windows laptop with 1.5 GB of memory with the pure Python 
version of ElementTree:

import xml.etree.ElementTree as ET
events = ET.iterparse(open("catalog.rdf"))
count = 0
for e in events:
       count += 1
       if count % 100000 == 0: print count
print count, "total events"

Here's an example output after I changed to using the extension module - 
by default, only the end-element events are reported. I think you'll be 
impressed by the timing. The only change was to the import staement, 
which now reads

import xml.etree.cElementTree as ET

sholden at bigboy ~/Projects/Python
$ time python test19.py
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1469971 total events

real    0m11.145s
user    0m10.124s
sys     0m0.580s

Good luck!

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------




More information about the Python-list mailing list