[Tutor] memory error
Joshua Valdez
jdv12 at case.edu
Wed Jul 1 16:13:08 CEST 2015
Hi Danny,
So I got my code workin now and it looks like this
TAG = '{http://www.mediawiki.org/xml/export-0.10/}page'
doc = etree.iterparse(wiki)
for _, node in doc:
if node.tag == TAG:
title = node.find("{http://www.mediawiki.org/xml/export-0.10/}title
").text
if title in page_titles:
print (etree.tostring(node))
node.clear()
Its mostly giving me what I want. However it is adding extra formatting (I
believe name_spaces and attributes). I was wondering if there was a way to
strip these out when I'm printing the node tostring?
Here is an example of the last few lines of my output:
[[Category:Asteroids| ]]
[[Category:Spaceflight]]</ns0:text>
<ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1>
</ns0:revision>
</ns0:page>
*Joshua Valdez*
*Computational Linguist : Cognitive Scientist
*
(440)-231-0479
jdv12 at case.edu <jdv2 at uw.edu> | jdv2 at uw.edu | joshv at armsandanchors.com
<http://www.linkedin.com/in/valdezjoshua/>
On Wed, Jul 1, 2015 at 1:17 AM, Danny Yoo <dyoo at hashcollision.org> wrote:
> Hi Joshua,
>
>
>
> The issue you're encountering sounds like XML namespace issues.
>
>
> >> So I tried that code snippet you pointed me too and I'm not getting any
> output.
>
>
> This is probably because the tag names of the XML are being prefixed
> with namespaces. This would make the original test for node.tag to be
> too stingy: it wouldn't exactly match the string we want, because
> there's a namespace prefix in front that's making the string mismatch.
>
>
> Try relaxing the condition from:
>
> if node.tag == "page": ...
>
> to something like:
>
> if node.tag.endswith("page"): ...
>
>
> This isn't quite technically correct, but we want to confirm whether
> namespaces are the issue that's preventing you from seeing those
> pages.
>
>
> If namespaces are the issue, then read:
>
> http://effbot.org/zone/element-namespaces.htm
>
More information about the Tutor
mailing list