[Tutor] memory error

Danny Yoo dyoo at hashcollision.org
Thu Jul 2 19:38:53 CEST 2015


On Thu, Jul 2, 2015 at 9:57 AM, Joshua Valdez <jdv12 at case.edu> wrote:
>
> Hi so I figured out my problem, with this code and its working great but its still taking a very long time to process...I was wondering if there was a way to do this with just regular expressions instead of parsing the text with lxml...


Be careful: there are assumptions here that may not be true.

To be clear: regular expressions are not magic.  Just because
something uses regular expressions does not make it fast.  Nor are
regular expressions appropriate for parsing tree-structured content.

For a humorous discussion of this, see:
http://blog.codinghorror.com/parsing-html-the-cthulhu-way/



> the idea would be to identify a <page> tag and then move to the next line of a file to see if there is a match between the title text and the pages in my pages file.

This makes another assumption about the input that isn't necessarily
true.  Just because you see tags and content on separate lines now
doesn't mean that this won't change in the future.  XML tree structure
does not depend on newlines.  Don't try parsing XML files
line-by-line.


> So again, my pages are just an array like: [Anarchism, Abrahamic Mythology, ...] I'm a little confused as to how to even start this my initial idea was something like this but I'm not sure how to execute it:
> wiki --> XML file
> page_titles -> array of strings corresponding to titles
>
> tag = r '(<page>)'
> wiki = wiki.readlines()
>
> for line in wiki:
>   page = re.search(tag,line)
>   if page:
>     ......(I'm not sure what to do here)
>
> is it possible to look ahead in a loop to discover other lines and then backtrack?
> I think this may be the solution but again I'm not sure how I would execute such a command structure...


You should probably abandon this line of thinking.    From your
initial problem description, the approach you have now should be
computationally *linear* in the size of the input.  So maybe your
program is fine.  The fact that your program was exhausting your
computer's entire memory in your initial attempt suggests that your
input file is large.  But how large?  It is much more likely that your
program is slow simply because your input is honking huge.


To support this hypothesis, we need more knowledge about the input
size.  How large is your input file?  How large is your collection of
page_titles?


More information about the Tutor mailing list