XML Parsing

Tue Apr 1 19:44:41 EDT 2008

On Apr 1, 1:42 pm, Alok Kothari <kothari.a... at gmail.com> wrote:
> Hello,
>           I am new to XML parsing.Could you kindly tell me whats the
> problem with the following code:
>
> import xml.dom.minidom
> import xml.parsers.expat
> document = """<token pos="nn">Letterman</token><token pos="bez">is</
> token><token pos="jjr">better</token><token pos="cs">than</
> token><token pos="np">Jay</token><token pos="np">Leno</token>"""
>
> # 3 handler functions
> def start_element(name, attrs):
>     print 'Start element:', name, attrs
> def end_element(name):
>     print 'End element:', name
> def char_data(data):
>     print 'Character data:', repr(data)
>
> p = xml.parsers.expat.ParserCreate()
>
> p.StartElementHandler = start_element
> p.EndElementHandler = end_element
> p.CharacterDataHandler = char_data
> p.Parse(document, 1)
>
> OUTPUT:
>
> Start element: token {u'pos': u'nn'}
> Character data: u'Letterman'
> End element: token
>
> Traceback (most recent call last):
>   File "C:/Python25/Programs/eg.py", line 20, in <module>
>     p.Parse(document, 1)
> ExpatError: junk after document element: line 1, column 33

I don't know if you are aware of the BeautifulSoup module:

import BeautifulSoup as bs

xml = """<token pos="nn">Letterman</token><token pos="bez">is</
token><token pos="jjr">better</token><token pos="cs">than</
token><token pos="np">Jay</token><token pos="np">Leno</token>"""

doc = bs.BeautifulStoneSoup(xml)

tokens = doc.findAll("token")
for token in tokens:
    for attr in token.attrs:
        print "%s : %s" % attr

    print token.string

--output:--
pos : nn
Letterman
pos : bez
is
pos : jjr
better
pos : cs
than
pos : np
Jay
pos : np
Leno