RDFXML Parser For "qualified Dublin Core" Database File

Wed May 30 22:56:02 EDT 2007

Brandon McGinty wrote:
[actually he top-posted, but I have moved his comments down
because he probably doesn't realise it's naughty]
> -----Original Message-----
> From: python-list-bounces+brandon.mcginty=gmail.com at python.org
> [mailto:python-list-bounces+brandon.mcginty=gmail.com at python.org] On Behalf
> Of Steve Holden
> Sent: Tuesday, May 29, 2007 11:30 PM
> To: python-list at python.org
> Subject: Re: RDFXML Parser For "qualified Dublin Core" Database File
> 
> Brandon McGinty wrote:
>> Hi All,
>>
>> My goal is to be able to read the www.gutenberg.org 
>> <http://www.gutenberg.org/> rdf catalog, parse it into a python 
>> structure, and pull out data for each record.
>>
>> The catalog is a Dublin core RDF/XML catalog, divided into sections 
>> for each book and details for that book.
>>
>> I have done a very large amount of research on this problem.
>>
>> I've tried tools such as pyrple, sax/dom/minidom, and some others both 
>> standard and nonstandard to a python installation.
>>
>> None of the tools has been able to read this file successfully, and 
>> those that can even see the data can take up to half an hour to load 
>> with 2 gb of ram.
>>
>> So you all know what I'm talking about, the file is located at:
>>
>> http://www.gutenberg.org/feeds/catalog.rdf.bz2
>>
>> Does anyone have suggestions for a parser or converter, so I'd be able 
>> to view this file, and extract data?
>>
>> Any help is appreciated.
>>
> Well, have you tried xml.etree.cElementTree, a part of the standard library
> since 2.5? Well worth a go, as it seems to outperform many XML libraries.
> 
> The iterparse function is your best bet, allowing you to iterate over the
> events as you parse the source, thus avoiding the need to build a huge
> in-memory data structure just to get the parsing done.
> 
> The following program took about four minutes to run on my not-terribly
> up-to-date Windows laptop with 1.5 GB of memory with the pure Python version
> of ElementTree:
> 
> import xml.etree.ElementTree as ET
> events = ET.iterparse(open("catalog.rdf")) count = 0 for e in events:
>        count += 1
>        if count % 100000 == 0: print count print count, "total events"
> 
> Here's an example output after I changed to using the extension module - by
> default, only the end-element events are reported. I think you'll be
> impressed by the timing. The only change was to the import staement, which
> now reads
> 
> import xml.etree.cElementTree as ET
> 
> sholden at bigboy ~/Projects/Python
> $ time python test19.py
> 100000
> 200000
> 300000
> 400000
> 500000
> 600000
> 700000
> 800000
> 900000
> 1000000
> 1100000
> 1200000
> 1300000
> 1400000
> 1469971 total events
> 
> real    0m11.145s
> user    0m10.124s
> sys     0m0.580s
> 
> Good luck!
> ------- End Original Message --------
 > Hi,
 > Thanks for the info. The speed is fantastic. 58 mb in under 15 sec, 
just as
 > shown.
 > I did notice that in this rdf file, I can't do any sort of find or 
findall.
 > I haven't been able to find any documentation on how to search. For
 > instance, in perl, one can do a search for "/rdf:RDF/pgterms:etext", and
 > when done in python, with many many variations, find and findall 
return no
 > results, either when searching from root or children of root.
 >
You should keep the list in the loop, as I am not an ElementTree expert, 
simply someone who knew of its capabilities on the basis of some earlier 
casual use. I am glad it is working well for you. Memory usage is also 
pretty low with the extension module.

You may well find that other readers can advise you about the selective 
aspects of processing. I'd have thought find() and findall() would have 
been what you want. The more you explain about your requirements the 
more help you are likely to get.

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------