Wikipedia XML Dump

Rustom Mody rustompmody at gmail.com
Tue Jan 28 20:52:48 EST 2014


On Wednesday, January 29, 2014 4:17:47 AM UTC+5:30, Burak Arslan wrote:
> hi,

> On 01/29/14 00:31, Kevin Glover wrote:
> > Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

> in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
> be your only option.

Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk



More information about the Python-list mailing list