[Tutor] python xml entity problem [xml parsing and DTD handling]

smith@jeve.org smith@jeve.org
Wed, 4 Sep 2002 17:40:34 -0400


On Tue, Sep 03, 2002 at 04:21:41PM -0700, Danny Yoo wrote:
> 
> 
> On Tue, 3 Sep 2002 smith@rfa.org wrote:
> 
> > I'm interested in parsing a xml file using the python tools in debian
> > woody. Everything seems to be ok until I reach a "&MAN;" My python
> > script just passes over it. My guess is that I have a external entity
> > resolver problem. I've been reading the Python XML book on O'reilly and
> > I believe I'm doing the right things. At least in terms of non external
> > entities. Does anybody have any examples or how can I make the program
> > recognize external entity. I'm still very new to python and xml so maybe
> > it's something I don't understand.
> 
> Hi Smith,
> 
> Yikes.  The documentation on Python and XML, in my opinion, is just
> absolutely hideous; most of the documentation assumes that you want to
> write this stuff in Java, which is just, well, lazy!  Even the 'Python &
> XML' book minimally touches external references.
> 
> 
> Forgive me for the rant; I'm just so frustrated by the lack of
> documentation in this area.  It just seems that the documentation and
> examples could be greatly improved.  Perhaps we can do an xml-parsing
> thing this week and send our examples over to the PyXML folks...  Hmmmm.)

I agree, the documentation is pretty weak for external references. Thanks
for your input on this. I responded to your message last night, but it
looks like you didn't it get. I've learned many things from the list, so
if I could some how give back, by sending my(ours) examples to the PyXML folks,
Cool. The only problem is I'm still very new to python and xml, so I'm
not sure how much I could contribute.

> For XML processing with DTD's to work well, it appears that you need to
> give the address of a real DTD --- otherwise, I think the system will just
> ignore external references outright!  The line:
> 
> 
> <!DOCTYPE schedule SYSTEM "ftp://something.org/pub/xml_files/program.dtd">
> 
> 
> should be changed to point to a real url that your system can find ---
> xml.sax will use the 'urllib' module to grab DTD information from this, so
> it's not enough to put a fill-me-in sort of thing: the DTD url has to be
> real.

Of coarse, I wasn't sure if it was ok to put the real one out there.
The real one is:

ftp://techweb.rfa.org/pub/r-boss/xml_files/schedule.dtd

> Also, there's a bug on line 3 of your DTD:
> 
> <!---  --->
> <!ELEMENT pgm_block(id?,arch?,air_date?,air_time?, ...
> <!---  --->
> 
> You need a space between the element name and the open parenthesis
> character.  Doh!  *grin*

Yeah, when I copied those lines from the original dtd, the space between
"pgm_block" and "(id?,arch?,air_date?,air_time?, ..." got deleted.

> Once we make those fixes, then everything should be ok.  I've written an
> example that uses a DTD that's embedded in the xml, so that the system
> doesn't have to try looking for it online.

Now I'm really confused. I've gone over the documentation many times and
my program still it passes over the external entity. Thanks, for your sample.
It works great. I appreciate the time you put into that, the xml files
I'm working with are generated form another program, so I'm not sure If
I would be able to use it.


regards,

SA

> 
> 
> ######
> import xml.sax
> 
> xml_text = \
> """<?xml version='1.0' encoding="UTF-8" standalone="no"?>
> <!DOCTYPE schedule [
>     <!ELEMENT schedule (pgm_block,segment*)>
>     <!ELEMENT pgm_block (id?,arch?,air_date?,air_time?,service_id?,
>                          block_time?,sch_status,mc,produce)>
>     <!ELEMENT id (#PCDATA)>
>     <!ELEMENT arch (#PCDATA)>
>     <!ELEMENT air_date (#PCDATA)>
>     <!ELEMENT air_time (#PCDATA)>
>     <!ELEMENT service_id (#PCDATA)>
>     <!ENTITY  BUR "Burmese">
>     <!ENTITY  KHM "Cambodian">
>     <!ENTITY  CAN "Cantonese">
>     <!ENTITY  KOR "Korean">
>     <!ENTITY  LAO "Lao">
>     <!ENTITY  MAN "Mandarin">
>     <!ENTITY  TIB "Tibetan">
>     <!ENTITY  UYG "Uyghur">
>     <!ENTITY  VIE "Vietnamese">
> ]>
> <?xml:stylesheet
> type="text/xsl"href="ftp://something.org/pub/xml_files/program.xsl"?>
> <schedule>
> <pgm_block>
>         <id></id>
>         <arch>http://something/MAN/2000/02/test.mp3</arch>
>         <air_date>2000/02/02</air_date>
>         <air_time>16:00</air_time>
>         <service_id>&MAN;</service_id>
>         <block_time>00:70:00</block_time>
>         <sch_status>archive</sch_status>
>         <mc>AW</mc>
>         <producer>AW</producer>
>         <editor>XZ</editor>
> </pgm_block>
> </schedule>"""
> 
> 
> class MyContentHandler(xml.sax.handler.ContentHandler):
>     def __init__(self):
>         xml.sax.handler.ContentHandler.__init__(self)
>         self.indent = 0
> 
>     def startElement(self, name, attrs):
>         self.printIndent()
>         print "I see the start of", name
>         self.indent += 4
> 
>     def endElement(self, name):
>         self.indent -= 4
>         self.printIndent()
>         print "I see the end of", name
> 
> 
>     def characters(self, text):
>         if text.strip():
>             self.printIndent()
>             print "I see characters", repr(text)
>             pass
> 
>     def printIndent(self):
>         print " " * self.indent,
> 
> 
> if __name__ == '__main__':
>     parser = xml.sax.make_parser()
>     handler = MyContentHandler()
>     parser.setContentHandler(handler)
>     parser.feed(xml_text)
>     parser.close()
> #######
> 
> 
> 
> I hope this helps!  If you have more questions, please feel free to ask.
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor