[Tutor] memory error

Danny Yoo dyoo at hashcollision.org
Wed Jul 1 01:27:27 CEST 2015


On Tue, Jun 30, 2015 at 8:10 AM, Joshua Valdez <jdv12 at case.edu> wrote:
> So I wrote this script to go over a large wiki XML dump and pull out the
> pages I want. However, every time I run it the kernel displays 'Killed' I'm
> assuming this is a memory issue after reading around but I'm not sure where
> the memory problem is in my script and if there were any tricks to reduce
> the virtual memory usage.

Yes.  Unfortunately, this is a common problem with representing a
potentially large stream of data with a single XML document.  The
straightforward approach to load an XML, to read it all into memory at
once, doesn't work when files get large.

We can work around this by using a parser that knows how to
progressively read chunks of the document in a streaming or "pulling"
approach.  Although I don't think Beautiful Soup knows how to do this,
however, if you're working with XML, there are other libraries that
work similarly to Beautiful Soup that can work in a streaming way.

There was a thread about this about a year ago that has good
references, the "XML Parsing from XML" thread:

    https://mail.python.org/pipermail/tutor/2014-May/101227.html

Stefan Behnel's contribution to that thread is probably the most
helpful in seeing example code:

    https://mail.python.org/pipermail/tutor/2014-May/101270.html

I think you'll probably want to use xml.etree.cElementTree; I expect
the code for your situation will look something like (untested
though!):

###############################
from xml.etree.cElementTree import iterparse, tostring

## ... later in your code, something like this...

doc = iterparse(wiki)
for _, node in doc:
    if node.tag == "page":
        title = node.find("title").text
        if title in page_titles:
            print tostring(node)
        node.clear()
###############################


Also, don't use "del" unless you know what you're doing.  It's not
particularly helpful in your particular scenario, and it is cluttering
up the code.


Let us know if this works out or if you're having difficulty, and I'm
sure folks would be happy to help out.


Good luck!


More information about the Tutor mailing list