Wikipedia XML Dump

kevingloveruk at gmail.com kevingloveruk at gmail.com
Tue Jan 28 06:45:32 EST 2014


Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

Is what I am trying to do actually feasible?

Are there any example programs or code snippets that would help me?

Any advice or guidance would be gratefully received.

Best regards,
Kevin Glover



More information about the Python-list mailing list