RDFXML Parser For "qualified Dublin Core" Database File
Steve Holden
steve at holdenweb.com
Wed May 30 02:29:45 EDT 2007
Brandon McGinty wrote:
> Hi All,
>
> My goal is to be able to read the www.gutenberg.org
> <http://www.gutenberg.org/> rdf catalog, parse it into a python
> structure, and pull out data for each record.
>
> The catalog is a Dublin core RDF/XML catalog, divided into sections for
> each book and details for that book.
>
> I have done a very large amount of research on this problem.
>
> I’ve tried tools such as pyrple, sax/dom/minidom, and some others both
> standard and nonstandard to a python installation.
>
> None of the tools has been able to read this file successfully, and
> those that can even see the data can take up to half an hour to load
> with 2 gb of ram.
>
> So you all know what I’m talking about, the file is located at:
>
> http://www.gutenberg.org/feeds/catalog.rdf.bz2
>
> Does anyone have suggestions for a parser or converter, so I’d be able
> to view this file, and extract data?
>
> Any help is appreciated.
>
Well, have you tried xml.etree.cElementTree, a part of the standard
library since 2.5? Well worth a go, as it seems to outperform many XML
libraries.
The iterparse function is your best bet, allowing you to iterate over
the events as you parse the source, thus avoiding the need to build a
huge in-memory data structure just to get the parsing done.
The following program took about four minutes to run on my not-terribly
up-to-date Windows laptop with 1.5 GB of memory with the pure Python
version of ElementTree:
import xml.etree.ElementTree as ET
events = ET.iterparse(open("catalog.rdf"))
count = 0
for e in events:
count += 1
if count % 100000 == 0: print count
print count, "total events"
Here's an example output after I changed to using the extension module -
by default, only the end-element events are reported. I think you'll be
impressed by the timing. The only change was to the import staement,
which now reads
import xml.etree.cElementTree as ET
sholden at bigboy ~/Projects/Python
$ time python test19.py
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1469971 total events
real 0m11.145s
user 0m10.124s
sys 0m0.580s
Good luck!
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
More information about the Python-list
mailing list