python xml DOM? pulldom? SAX?

Mon Aug 29 16:28:46 EDT 2005

jog <jo at johannageiss.de> wrote:
> Hi,
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
> <parent>
>    <page>
>     <title>bla</title>
>     <id></id>
>     <revision>
>       <id></id>
>       <text>blablabla</text>
>     <revision>
>    </page>
>    <page>
>    </page>
>     ....
> </parent>
> I want to combine the text out of page:title and page:revision:text for
> every single page element. One by one I want to index these combined
> texts (so for each page one index)
> What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
> or pulldom?
> Or should I just use Xpath somehow.
> I don`t want to do anything else with his xml file afterwards.
> I hope someone will understand me.....
> Thank you very much
> Jog

I would use Expat interface from Python, Awk, or even Bash shell.  I'm
most familiar with shell interface to Expat, which would go something
like

    start()		# Usage: start tag att=value ...
    {
	case $1 in
	    page) unset title text ;;
	esac
    }
    data()		# Usage: data text
    {
	case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
	    title.page.*) title=$1 ;;
	    text.revision.page) text=$1 ;;
	esac
    }
    end()		# Usage: end tag
    {
	case $1 in
	    page) echo "title=$title text=$text" ;;
	esac
    }
    expat -s start -d data -e end < file.xml

-- 
William Park <opengeometry at yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
	   http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
	  http://freshmeat.net/projects/bashdiff/