[Tutor] memory error

Joshua Valdez jdv12 at case.edu
Wed Jul 1 16:13:08 CEST 2015


Hi Danny,

So I got my code workin now and it looks like this

TAG = '{http://www.mediawiki.org/xml/export-0.10/}page'
doc = etree.iterparse(wiki)

for _, node in doc:
    if node.tag == TAG:
        title = node.find("{http://www.mediawiki.org/xml/export-0.10/}title
").text
        if title in page_titles:
            print (etree.tostring(node))
        node.clear()
Its mostly giving me what I want.  However it is adding extra formatting (I
believe name_spaces and attributes).  I was wondering if there was a way to
strip these out when I'm printing the node tostring?

Here is an example of the last few lines of my output:

[[Category:Asteroids| ]]
[[Category:Spaceflight]]</ns0:text>
      <ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1>
    </ns0:revision>
  </ns0:page>






*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
     *

(440)-231-0479
jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>

On Wed, Jul 1, 2015 at 1:17 AM, Danny Yoo <dyoo at hashcollision.org> wrote:

> Hi Joshua,
>
>
>
> The issue you're encountering sounds like XML namespace issues.
>
>
> >> So I tried that code snippet you pointed me too and I'm not getting any
> output.
>
>
> This is probably because the tag names of the XML are being prefixed
> with namespaces.  This would make the original test for node.tag to be
> too stingy: it wouldn't exactly match the string we want, because
> there's a namespace prefix in front that's making the string mismatch.
>
>
> Try relaxing the condition from:
>
>     if node.tag == "page": ...
>
> to something like:
>
>     if node.tag.endswith("page"): ...
>
>
> This isn't quite technically correct, but we want to confirm whether
> namespaces are the issue that's preventing you from seeing those
> pages.
>
>
> If namespaces are the issue, then read:
>
>     http://effbot.org/zone/element-namespaces.htm
>


More information about the Tutor mailing list