Wikipedia XML Dump

alex23 wuwei23 at gmail.com
Tue Jan 28 20:39:48 EST 2014


On 28/01/2014 9:45 PM, kevingloveruk at gmail.com wrote:
> I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".
>
> I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:
>
> Is what I am trying to do actually feasible?

Rather than parsing through 40GB+ every time you need to do a search, 
you should get better performance using an XML database which will allow 
you to do queries directly on the xml data.

http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients




More information about the Python-list mailing list