python xml DOM? pulldom? SAX?
William Park
opengeometry at yahoo.ca
Mon Aug 29 16:28:46 EDT 2005
jog <jo at johannageiss.de> wrote:
> Hi,
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
> <parent>
> <page>
> <title>bla</title>
> <id></id>
> <revision>
> <id></id>
> <text>blablabla</text>
> <revision>
> </page>
> <page>
> </page>
> ....
> </parent>
> I want to combine the text out of page:title and page:revision:text for
> every single page element. One by one I want to index these combined
> texts (so for each page one index)
> What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
> or pulldom?
> Or should I just use Xpath somehow.
> I don`t want to do anything else with his xml file afterwards.
> I hope someone will understand me.....
> Thank you very much
> Jog
I would use Expat interface from Python, Awk, or even Bash shell. I'm
most familiar with shell interface to Expat, which would go something
like
start() # Usage: start tag att=value ...
{
case $1 in
page) unset title text ;;
esac
}
data() # Usage: data text
{
case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
title.page.*) title=$1 ;;
text.revision.page) text=$1 ;;
esac
}
end() # Usage: end tag
{
case $1 in
page) echo "title=$title text=$text" ;;
esac
}
expat -s start -d data -e end < file.xml
--
William Park <opengeometry at yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
More information about the Python-list
mailing list