From larsga@step.de Fri May 1 11:34:24 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 01 May 1998 12:34:24 +0200 Subject: [XML-SIG] Re: State of the world Message-ID: <3549A530.E5BE4CC4@step.de> Andrew Kuchling wrote: > > What would xmlproc buy that the other don't--validation, probably? Yes. Validation, access to the DTD and the differences in interpretation of the markup that spring from this. The non-validating parser (the validating one is built on top of this) is also faster than xmllib (at least in their current versions, neither have been tweaked for speed yet). > xmldtd.py seems fairly independent of xmlproc (at least from > my superficial look at the code), so it would probably be a good > idea. It is rather independent. Although the DTD-parser is in xmlproc.py and has some strings attached there, it can probably be extracted fairly easily. I'll see what I can do. > We could provide parser subclasses that assemble an object > representing the DTD as they parse xmldtd.py has this already, but like you say we should have the same for xmltok as well. I plan to restructure xmldtd.py and xmlval.py a bit, so maybe I can generalize them to the point where they can sit on top of xmltok as well. --Lars M. From larsga@step.de Sun May 3 10:08:11 1998 From: larsga@step.de (Lars Marius Garshol) Date: Sun, 03 May 1998 11:08:11 +0200 Subject: [XML-SIG] State of the world References: <13639.20813.303237.521754@newcnri.cnri.reston.va.us> <354896FC.EDBFE1C9@step.de> <13640.49635.141273.252166@newcnri.cnri.reston.va.us> <3548D1C5.1C8785D8@technologist.com> Message-ID: <354C33FB.31F72D74@step.de> Paul Prescod wrote: > > xmllib > came first, and was a great contribution to the Python library. But it may > or may not be the best package to take us into the future. Why can't we take several packages with us? xmllib does have the advantages of being simple to use and having the same interface as sgmllib and htmllib, so it fits better in the Python standard distribution than the larger and more complex xmlproc and XML-Toolkit. > I haven't > looked at its native interface since the early days because I always use > SAX (with whatever parser is around). Since native interfaces probably > aren't that interesting, we should just figure out what gives the best SAX > performance (or can be tweaked to). Well, SAX doesn't give access to all information about a document, so native interfaces are interesting where you want to go beyond what SAX offers. SAX level 2 may happen at some point, but until it does native interfaces remain interesting in some cases. --Lars M. From akuchlin@cnri.reston.va.us Tue May 5 16:49:17 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Tue, 5 May 1998 11:49:17 -0400 (EDT) Subject: [XML-SIG] Re: saxlib 1.0beta In-Reply-To: <354F2AC1.95F6DD67@step.de> References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> Message-ID: <13647.12442.68887.467809@newcnri.cnri.reston.va.us> Lars Marius Garshol writes (in private e-mail): >Andrew Kuchling wrote (about using the SAX implementation with xmllib): >> Another thing I noticed: how would you go about changing the >> available entity definitions? That is, I want to handle a £ >> entity, or something else. Is the accepted way of doing this going to >> be parser-specific? > >No. The accepted way to do it is this: > > >]> > >This will make conformant non-validating parsers (xmlproc, XP and some >others, but not xmllib) insert $ whenever they see $. OK. To handle this, you could override the handle_doctype() method, so saxlib's drv_xmllib could do that. On the other hand, if it's required for conformance, perhaps this behaviour should be added to xmllib.py, so that simple uses of xmllib are still as conformant as possible. Opinions? >[AMK] >> Right now you'd do it for xmllib by changing the >> entitydefs attribute of the XMLParser object, so it's handled outside >> of SAX. > >This is because xmllib does not really read the internal doctype, which >it (according to the letter of the standard) should. >[AMK] >> Does SAX really not define anything for processing entities? > >Only for external entities where you can remap the system identifier >(what non-SGML-ers call a URI). > >Since XML provides a way to handle this I don't think SAX should do >it. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ It was a wasted life, but God forbid that one should be hard upon it, or upon anything in this world that is not deliberately and coldly wrong . . . -- Charles Dickens, in a letter to his friend John Forster. From larsga@step.de Tue May 5 17:12:27 1998 From: larsga@step.de (Lars Marius Garshol) Date: Tue, 05 May 1998 18:12:27 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta Message-ID: <354F3A6B.6F33A287@step.de> Andrew Kuchling wrote: > > OK. To handle this, you could override the handle_doctype() > method, so saxlib's drv_xmllib could do that. On the other hand, > if it's required for conformance, perhaps this behaviour should be > added to xmllib.py, so that simple uses of xmllib are still as > conformant as possible. Opinions? I think this is something xmllib should handle, if Sjoerd (or someone else) wants to implement it. It is actually difficult to implement it in the driver because the entity itself must also be parsed, which can result in new element, PI, entity and what-have-you events. So in effect the entity value must be passed back to the parser to be parsed by it, and I see no easy way of implementing that with the official xmllib interface. (Maybe there are undocumented methods that make this easy.) Actually, once I release the next version of xmlproc (RSN) the DTD parser there could easily be used to run through the internal DTD subset to get the entity declarations there. --Lars M. From jeff@Digicool.com Tue May 5 18:19:27 1998 From: jeff@Digicool.com (Jeffrey P Shell) Date: Tue, 05 May 1998 13:19:27 -0400 Subject: [XML-SIG] object sites and broadcast news Message-ID: <199805051706.NAA14239@gator.digicool.com> One of the issues about Python is that it's "not well known". We know this (hence this list) all too well. A good start (I believe) in broadcasting the message further is making sure some of the more major Object sites recieve news about Python and projects using it. And that announcements to these sites are consistent (ie, one month someone only submits a news item to Objectnews, next month someone submits a small paragraph about something else to Scripting news. If both are hit each time, it helps make Python appear more important) For example, Objectnews ( http://www.objectnews.com/ ) seems to have taken a liking to Python. Visit the site and search the current news for Python. There are a couple of items about JPython and also the 1.5.1 update (and I think I saw something about Fnorb on there too), along with a little bit of nag for readers to replace their older scripting language with Python due to its OO features. It also mentions that Python is still somewhat of a silent grower, increasing in popularity but still nowhere close to a household name. I propose that some of the lists and object news sites need to be updated. Yahoo and Cetus (www.cetus-links.org) should be updated, and other link directories like Developer.com should have some Python links added. (ie, submitting the XML-SIG page to the XML Programming directory at Developer.com and other XML lists people are aware of). Bobo should be pushed as another _free_ Object Oriented alternative to CGI (besides Java Servlets), along with mentions of the BoboPOS and DTML. Mention of commercial successes with Bobo and it's commercial sibling Principia should get out as well, as well as any other commercial success using Python. More white papers, especially ones on _proven_ technologies like Acquisition should also be pushed. Links to white papers on technologies and techniques never completed should be removed. I propose that the Advocacy sig decide which directories need to be updated (Yahoo, etc..), and which news sites (ie, Objectnews) should recieve news announcements about Python and Python-oriented projects. This list should be tracked somewhere (Andrew's advocacy pages *nudge nudge, wink wink*) and everyone should take some responsibility for notifying the sites of new announcements, etc. As commercial products like Principia or free tools like Bobo get updated or released, some sort of announcement should go out to as many of these as possible, like a press release. Organizations should take responsibility for announcing their own projects/products. SIG's should also (independantly) be active in submitting links and progress announcements to sites related to the SIG's objectives. Let the world see what we're doing with XML, Threading, Database integration, yada yada yada. My first contributions of sites for submissions are: Yahoo - only 13 links, compared with 115 for Perl Cetus-links - http://www.cetus-links.org/oo_python.html News sites: Objectnews - http://www.objectnews.com/ - These guys have a few Python snippets of news (primarily Python 1.5.1, JPython releases, and FNorb) and also seem to have a growing pro-Python stance. Scripting - http://www.scripting.com/ - Yes, home of Frontier. But these guys do seem to have an interest in other cross-platform scripting solutions as well. Their news items encompass a lot of different things, and there was even a mention of Principia a few weeks back. The XML-SIG knows the growing weight these guys are putting on XML. -- There are other sites as well. We should keep the list relatively small to ease the people who have to post messages to all of them. The news sites should probably be sites that will have respect for the features of Python. Thoughts, submissions? (note : cross-posted to XML-SIG and the Bobo-list) -- "Green Tony squeeled and I'm off to Galaxy X" .jPS jeff@Digicool.com Digital Creations http://www.digicool.com/ "The unbeatable system engenders rot" From jeff@Digicool.com Tue May 5 20:19:00 1998 From: jeff@Digicool.com (Jeffrey P Shell) Date: Tue, 05 May 1998 15:19:00 -0400 Subject: [XML-SIG] Re: object sites and broadcast news Message-ID: <199805051905.PAA15519@gator.digicool.com> Markus Fleck wrote: >Jeffrey P Shell wrote: >> Scripting - http://www.scripting.com/ - Yes, home of Frontier. But these guys >> do seem to have an interest in other cross-platform scripting solutions as well. > >Not really. They sell their own scripting language. They are just >interested >in *interfacing* to Python, not really in *using* Python themselves. >This is >what their "RPC over XML" idea is all about. But they do display news about other things (I think I should have mentioned that the name of the front-page on Scripting.com is "Scripting News", which is _not_ "Frontier News"). It's usually things that relate in some fashion to Frontier, but like I mentioned, Principia got a nod on their site (and Principia could actually be viewed as competition to Frontier with its object-oriented database and ability to run dymanic sites, et al), Hypercard gets nods, etc. I think they have an interest in showing what else is being done that is in the same line as their ideas in order to give their ideas more weight. I think using this as an opportunity to evangelize Python is very good. To me, Frontier has a lot of interesting ideas, but I've never been able to grok my brain around implementing anything with it (its interface is clumsy to me, its database concept is strange to me, etc.). On the other hand, I was able to pick Python up right off the bat. Others might feel similar. And vice-versa. But since they have news about things being done with scripting languages in general, we may as well make use of that. Userland is trying to make a lot of noise with XML and Frontier, and that noise is starting to get heard. I think the Python community should do the same. I think the concept of XML as not only RPC but as an object serialization format holds a lot of merit. Being able to dump out an object's data from Python into XML and view it in MSXML, turn it into a (future) Principia XML-Document object, treat it with XSL, and chew it up into Perl or something is really cool. Or even the ability to store a fairly human readable XML file for years and be able to bring its data up into whatever the fad programming language of 2005 will be. I don't know how strongly Python tools like HTMLGen were pushed (especially in the hey-day). I think Bobo and DocumentTemplate could have been (and could still be) pushed a little bit more. XML has good importance potential, it would be a shame if the Python community came out with cool tools six months too late and with little or no fanfare. I don't think that Python should be re-geared around XML proclaiming XML as "The Next Big Thing" (I'm still laughing at Wired's "Everything is going to be Push, kiss your browser goodbye!" article), but we've got some cool stuff here. Let's make it known! -- "Green Tony squeeled and I'm off to Galaxy X" .jPS jeff@Digicool.com Digital Creations http://www.digicool.com/ "The unbeatable system engenders rot" From akuchlin@cnri.reston.va.us Wed May 6 16:49:49 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Wed, 6 May 1998 11:49:49 -0400 (EDT) Subject: [XML-SIG] Bits of XML HOWTO available Message-ID: <13648.33962.932652.864741@newcnri.cnri.reston.va.us> I've put the first few pieces of what will someday become the XML HOWTO at: http://www.python.org/doc/howto/xml/ Currently I've written the "introduction to XML" section, and some brief explanations of what SAX and DOM are. There's no burning need for people to look at it, though do feel free to offer comments on what's there. The reference section currently just lists the relevant modules; there are about 20 of them. This means the reference section is going to be quite large; I'm probably going to split things into separate tutorial and reference documents. Anyway, could the SAX and DOM authors please take a look at the list below, and tell me if any of these modules should *not* be documented (because they're for internal use, or because they're outdated)? Did I miss any modules? Thanks... 6.1 xml 6.2 xml.dom 6.2.1 xml.dom.builder 6.2.2 xml.dom.core 6.2.3 xml.dom.esis_builder 6.2.4 xml.dom.html_builder 6.2.5 xml.dom.sax_builder 6.2.6 xml.dom.transform 6.2.7 xml.dom.transformer 6.2.8 xml.dom.walker 6.2.9 xml.dom.writer 6.3 xml.marshal 6.4 xml.sax 6.4.1 xml.sax.drivers 6.4.2 xml.sax.drivers.drv_xmllib 6.4.3 xml.sax.drivers.drv_xmlproc 6.4.4 xml.sax.drivers.drv_xmlproc_val 6.4.5 xml.sax.drivers.drv_xmltok 6.4.6 xml.sax.drivers.drv_xmltoolkit 6.4.7 xml.sax.saxexts 6.4.8 xml.sax.saxexts.ParserFactory 6.4.9 xml.sax.saxlib 6.4.18 xml.sax.saxutils I'm not sure if the different SAX drivers should be documented; perhaps they should be treated as internal modules, used by a factory function. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ ... lies and half-truths and deceits. Our society is built on them. Our churches and schools and politicians deal in them. Your entire lives are awash with them. -- The Truth, in ENIGMA #2: "The Truth" From larsga@step.de Wed May 6 18:18:33 1998 From: larsga@step.de (Lars Marius Garshol) Date: Wed, 06 May 1998 19:18:33 +0200 Subject: [XML-SIG] Bits of XML HOWTO available References: <13648.33962.932652.864741@newcnri.cnri.reston.va.us> Message-ID: <35509B69.FBC55552@step.de> Andrew Kuchling wrote: > > Anyway, could the SAX and DOM authors please take a look at the list > below, and tell me if any of these modules should *not* be documented > (because they're for internal use, or because they're outdated)? Did > I miss any modules? Thanks... > > 6.4.7 xml.sax.saxexts > 6.4.8 xml.sax.saxexts.ParserFactory > 6.4.9 xml.sax.saxlib > 6.4.18 xml.sax.saxutils These are quite OK to document. They are up to date and will stay in the library. Additional modules/classes will probably be added with time. > I'm not sure if the different SAX drivers should be documented; > perhaps they should be treated as internal modules, used by a factory > function. They are written to be identical in interface, but since the underlying parsers have different capabilities what could usefully be documented about them is of course what they _don't_ implement. That's obviously going to be quite a bit of work and require a lot of updating, but will probably be very valuable. If you don't want to take on all that work I think there should just be a list of the drivers with brief descriptions. --Lars M. From Jack.Jansen@cwi.nl Wed May 6 22:34:09 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Wed, 06 May 1998 23:34:09 +0200 Subject: [XML-SIG] Bits of XML HOWTO available In-Reply-To: Message by Lars Marius Garshol , Wed, 06 May 1998 19:18:33 +0200 , <35509B69.FBC55552@step.de> Message-ID: Recently, Lars Marius Garshol said: > That's obviously going to be quite a bit of work and require a lot of > updating, but will probably be very valuable. If you don't want to take > on all that work I think there should just be a list of the drivers with > brief descriptions. Just a quick note in case you're going to update the documentation: I've renamed the xmltok module to pyexpat. James Clark's xmltok has been renamed to expat (xmltok was formerly the name of both the tokenizer and the whole parser around it), and because the two xmltok's (the original C one and the Python wrapper around it) have already caused confusion I've decided to go for pyexpat. I'll put together a new distribution shortly, but aside from the name change and Mac support there's nothing new really (not in expat either, as far as I could see). -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From papresco@technologist.com Thu May 7 12:25:49 1998 From: papresco@technologist.com (Paul Prescod) Date: Thu, 07 May 1998 07:25:49 -0400 Subject: [XML-SIG] Re: saxlib 1.0beta References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> Message-ID: <35519A3D.320176D9@technologist.com> > OK. To handle this, you could override the handle_doctype() > method, so saxlib's drv_xmllib could do that. On the other hand, > if it's required for conformance, perhaps this behaviour should be > added to xmllib.py, so that simple uses of xmllib are still as > conformant as possible. Opinions? Okay, I'm just going to come out and be a jerk: I only think we should continue to add stuff to xmllib if we can justify its superiority over xmlproc. I would much rather add an sgmllib-like driver for xmlproc and then optimize the hell out of a *single* XML processor rather than having two in wide usage. xmllib was a simple tokenizer to get us off of the ground, but xmlproc seems more scalable (e.g. to DTD information, entities, etc.) For some future Python upgrade, we should also consider deprecating the sgmllib interface in favour of the SAX interface for the same reasons that we would deprecate a socket interface that was too divergent from that used in other languages. Paul Prescod - http://itrc.uwaterloo.ca/~papresco "Perpetually obsolescing and thus losing all data and programs every 10 years (the current pattern) is no way to run an information economy or a civilization." - Stewart Brand, founder of the Whole Earth Catalog http://www.wired.com/news/news/culture/story/10124.html From larsga@step.de Thu May 7 12:46:27 1998 From: larsga@step.de (Lars Marius Garshol) Date: Thu, 07 May 1998 13:46:27 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> Message-ID: <35519F13.DFF83B0B@step.de> Paul Prescod wrote: > > Okay, I'm just going to come out and be a jerk: Well, this should generate some traffic, if nothing else. > I only think we should > continue to add stuff to xmllib if we can justify its superiority over > xmlproc. [...] and > then optimize the hell out of a *single* XML processor rather than having > two in wide usage. How likely is it that other people than me will work on xmlproc? And how likely is it that having xmllib around will mean that less resources go into improving xmlproc? (These questions are not meant rhetorically.) > I would much rather add an sgmllib-like driver for xmlproc SAX, rather, to get access to PyExpat as well. > For some future Python upgrade, we should also consider deprecating the > sgmllib interface in favour of the SAX interface for the same reasons that > we would deprecate a socket interface that was too divergent from that > used in other languages. Here we differ: to me, a major part of the point of using Python is that I can easily do stuff that's impossible/much less convenient in ordinary languages (Java/C++ etc). I like the *mllib interfaces precisely because they are divergent. I know this hurts CORBA/JPython integration, but surely we can have the best of both worlds? --Lars M. (who dreams of designing a SAX alternative in Common Lisp, to see how nice XML processing _really_ could be) From papresco@technologist.com Thu May 7 14:00:11 1998 From: papresco@technologist.com (Paul Prescod) Date: Thu, 07 May 1998 09:00:11 -0400 Subject: [XML-SIG] Re: saxlib 1.0beta References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <35519F13.DFF83B0B@step.de> Message-ID: <3551B05A.7E7D2745@technologist.com> Lars Marius Garshol wrote: > How likely is it that other people than me will work on xmlproc? And how > likely is it that having xmllib around will mean that less resources go > into improving xmlproc? (These questions are not meant rhetorically.) If I need a new feature I am more likely to tweak code that is 95% "done" than code that is 30% done. But more important, I much less often have to tweak code that is almost done. > > I would much rather add an sgmllib-like driver for xmlproc > > SAX, rather, to get access to PyExpat as well. So you mean an sgmllib-like driver for SAX? I guess that would be okay as long as it was a short-term hack. Over the long term, sgmllib is slow enough. I wouldn't want to add a major layer of indirection to (e.g.) grail. > Here we differ: to me, a major part of the point of using Python is that > I can easily do stuff that's impossible/much less convenient in ordinary > languages (Java/C++ etc). I like the *mllib interfaces precisely because > they are divergent. Sure, they are divergent, but are they easier than SAX? They don't seem so to me. They sure will not be to somone coming from a SAX background. Paul Prescod - http://itrc.uwaterloo.ca/~papresco Can we afford to feed that army, while so many children are naked and hungry? Can we afford to remain passive, while that soldier-army is growing so massive? - "Gabby" Barbadian Calpysonian in "Boots" From akuchlin@cnri.reston.va.us Thu May 7 15:31:54 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Thu, 7 May 1998 10:31:54 -0400 (EDT) Subject: [XML-SIG] Re: saxlib 1.0beta In-Reply-To: <35519A3D.320176D9@technologist.com> References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> Message-ID: <13649.48773.765303.435308@newcnri.cnri.reston.va.us> Paul Prescod writes: >Okay, I'm just going to come out and be a jerk: I only think we should >continue to add stuff to xmllib if we can justify its superiority over >xmlproc. I would much rather add an sgmllib-like driver for xmlproc and Let me get your proposal clear; you're suggesting that we drop having a driver for using xmllib, and just use xmlproc or Expat, right? I don't really see a problem with that; we have to distribute the .py files for DOM, SAX, and whatever else, so adding the xmlproc files to the list isn't a big problem. It's sort of unsettling that Python will ship with an XML parser that then won't be used at all by the fancier XML tools, but I don't see a way around that, unless some subset of the XML package becomes a standard part of Python. >For some future Python upgrade, we should also consider deprecating the >sgmllib interface in favour of the SAX interface for the same reasons that >we would deprecate a socket interface that was too divergent from that >used in other languages. Yes; good point. An aside about DOM: I haven't been trying to focus the SIG's interest to DOM up to this point, for two reasons. First, it's still in the working draft stage. Secondly, SAX is much closer to being frozen (has Megginson officially stamped his Java implementation as 1.0 final?). Therefore Lars M. has been busier than Stefane... DOM will also present some tricky technical problems, such as thread safety. Within a few weeks, I hope to have the following done: document the Python SAX interfaces, provide a first cut at a SAX/Python tutorial, wrap SAX + Expat + current DOM + documentation into a single .tgz file, and then we'd have a first snapshot release. (It would be nice if some Java-related code could be in that release, too. Paul, what's the status of your work with JPython?) Once that's done, hopefully some brave souls will start trying to use SAX, so comments and bug reports will begin coming in. At the same time, we can worry about JPython, about DOM, and about Unicode support for Python. I'm not sure which should come first; perhaps DOM can still wait, while we try to get a good solution for Unicode. On the other hand, Unicode support won't be added as part of the 1.5 development cycle, so it would have to wait until the next major release of Python, and that's probably a long time off. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ Destiny smells of dust and the libraries of night. He leaves no footprints. He casts no shadow. -- From SANDMAN: "Season of Mists", episode 0 From akuchlin@cnri.reston.va.us Thu May 7 15:37:14 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Thu, 7 May 1998 10:37:14 -0400 (EDT) Subject: [XML-SIG] Bits of XML HOWTO available In-Reply-To: References: <35509B69.FBC55552@step.de> Message-ID: <13649.50767.270686.606613@newcnri.cnri.reston.va.us> Jack Jansen writes: >Just a quick note in case you're going to update the documentation: >I've renamed the xmltok module to pyexpat. James Clark's xmltok has OK; thanks. The Expat module is another thing I was wondering about; is it essentially finished, or does it still need some work? Last night I took a look at the module and at Expat's interface, and there didn't seem to be anything significant missing from the extension, but ISTR you once said that it still needed to be completed. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ We can lick gravity, but sometimes the paperwork is overwhelming. -- Wernher Von Braun From akuchlin@cnri.reston.va.us Thu May 7 16:01:22 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Thu, 7 May 1998 11:01:22 -0400 (EDT) Subject: [XML-SIG] Re: saxlib 1.0beta In-Reply-To: <35519F13.DFF83B0B@step.de> References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <35519F13.DFF83B0B@step.de> Message-ID: <13649.51922.308079.993134@newcnri.cnri.reston.va.us> Lars Marius Garshol writes: >> I would much rather add an sgmllib-like driver for xmlproc >SAX, rather, to get access to PyExpat as well. This reminds me of something else. DOM is going to need to use the various drv_ files as well in order to support various parsers. It seems redundant to write code for making SAX work with Expat, and then have to write code for DOM via Expat. (DOM can't just sit on top of SAX because level 1 SAX doesn't provide an interface for comments along, so you lose comments when you go through SAX. This is bad for DOM-using applications that modify XML documents.) Sharing drivers would also let both SAX and DOM use an ESIS driver, or anything else that gets written. Therefore, should the xml.sax.drivers package be moved up a level, to xml.drivers? -- A.M. Kuchling http://starship.skyport.net/crew/amk/ The young man's mother had died bringing him into the world; she gave him life, a small wooden finger-ring, and the name Vassily. There have been worse legacies. -- The grandfather's tale in SANDMAN #38: "The Hunt" From papresco@technologist.com Thu May 7 18:01:09 1998 From: papresco@technologist.com (Paul Prescod) Date: Thu, 07 May 1998 13:01:09 -0400 Subject: [XML-SIG] Re: saxlib 1.0beta References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <13649.48773.765303.435308@newcnri.cnri.reston.va.us> Message-ID: <3551E8D5.8ACFAC2B@technologist.com> Andrew Kuchling wrote: > > Let me get your proposal clear; you're suggesting that we drop > having a driver for using xmllib, and just use xmlproc or Expat, > right? No. I guess I'm suggesting we just drop xmllib. Turn it into a wrapper for xmlproc unless it is faster or otherwise better as it is. My point is that we should ship xmlproc for the DTD stuff, so once we do that, what's the point of xmllib anymore? We can either think of it as a regex vs. re issue (deprecate one module for the other) or we could just consider xmllib a deprecated *interface* to xmlproc (there is probably a precedent for this, too, but I don't know it). > I don't really see a problem with that; we have to distribute > the .py files for DOM, SAX, and whatever else, so adding the xmlproc > files to the list isn't a big problem. It's sort of unsettling that > Python will ship with an XML parser that then won't be used at all by > the fancier XML tools, but I don't see a way around that, unless some > subset of the XML package becomes a standard part of Python. Unless they are large, they might as well all come with Python. Perhaps the Python library documentation would only document the most important classes, though. > An aside about DOM: I haven't been trying to focus the SIG's > interest to DOM up to this point, for two reasons. First, it's still > in the working draft stage. Secondly, SAX is much closer to being > frozen (has Megginson officially stamped his Java implementation as > 1.0 final?). Therefore Lars M. has been busier than Stefane... DOM > will also present some tricky technical problems, such as thread > safety. Yes. The DOM may also present performance problems. Python objects are a little heavyweight for large documents. Maybe someone will/should write a DOM in C. I wouldn't expect it to be that large or complicated. It would primarily shift the property lookup from bulky hash tables to svelte C code and (relatively) bulky PyObjects to C strings and pointers. > (It would be nice if some Java-related code could be in that > release, too. Paul, what's the status of your work with JPython?) I was waiting for the next SAX release, but now I'm in the final stages of a book. Keep me updated on your progress and I'll try to squeeze my stuff in before 1.0. I could do a tutorial and an example program. > Once that's done, hopefully some brave souls will start trying > to use SAX, so comments and bug reports will begin coming in. At the > same time, we can worry about JPython, about DOM, and about Unicode > support for Python. I'm not sure which should come first; perhaps DOM > can still wait, while we try to get a good solution for Unicode. On > the other hand, Unicode support won't be added as part of the 1.5 > development cycle, so it would have to wait until the next major > release of Python, and that's probably a long time off. Unicode should definately be higher priority than DOM, IMO. I would go so far as to say that Unicode should *determine* the date of the next release of Python (though I don't care if it is called Python 1.5.2 or Python 1.6). Maybe I'm naive about what it means and takes to release a new version, but what's the harm in perfecting the XML and Unicode stuff and releasing it as 1.6 in say, three months? It could even be a minor publicity event. If we were the first with a C implementation, we might have the fastest DOM implementation at the time of release. Paul Prescod - http://itrc.uwaterloo.ca/~papresco Can we afford to feed that army, while so many children are naked and hungry? Can we afford to remain passive, while that soldier-army is growing so massive? - "Gabby" Barbadian Calpysonian in "Boots" From larsga@step.de Fri May 8 09:33:21 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 08 May 1998 10:33:21 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <35519F13.DFF83B0B@step.de> <13649.51922.308079.993134@newcnri.cnri.reston.va.us> Message-ID: <3552C351.5DEEDBF3@step.de> Andrew Kuchling wrote: > > This reminds me of something else. DOM is going to need to > use the various drv_ files as well in order to support various > parsers. It seems redundant to write code for making SAX work with > Expat, and then have to write code for DOM via Expat. I sat down to write a DOM builder for xmlproc, but discovered there was no point. Stephane has been really smart about this: the base builder class uses a SAX interface so SAX drivers are automatically usable as DOM builders by using sax_builder as the SAX document handler. (The SAX builder converts character data from being a piece of a buffer to a single string, which must be done anyway.) > (DOM can't just sit on top of SAX because level 1 SAX doesn't provide > an interface for comments along, so you lose comments when you go > through SAX. This is bad for DOM-using applications that modify XML > documents.) Well, we don't preserve entity information either, or whitespace in tags, so writing an XML editor on top of this is probably not going to work anyway. In fact, all XML parsers I know of today throws away lots of lexical information. I've thought about adding this to xmlproc, but have so far refrained from it, since it would mean slowing it down and it would be a lot of work both to implement and to get the interfaces right. So I'm not sure it's worth the extra bother just to get comments into the DOM tree. And if we do decide we should have comments, then I think having non-standard SAX drivers is the way to go. Anyone else have an opinion on this? Or a use for comments in the DOM? Personally I think we should leave them out, for exactly the same reasons they were left out of SAX, unless someone can think of a convincing argument why editor-like applications can work without sufficient lexical information. > Sharing drivers would also let both SAX and DOM use an ESIS driver, or > anything else that gets written. That we can do already, since ESIS does not contain comments. (Although one can pick out entity boundaries from an ESIS stream, although not from the one generated by nsgmls, if I remember correctly.) > Therefore, should the xml.sax.drivers package be moved up a > level, to xml.drivers? If we think people will use the DOM without SAX I think we should do that, yes. One other thing is that I think it should be a little easier to make SAX and the DOM work together. Ideally there should be a function that let you say make_dom("mydoc.xml") # would be really cool if it worked for # .sgml? and .html? as well :-) and gave you back a DOM Document object. (Behind the scenes the SAX ParserFactory should of course be used to get a parser driver.) Another thing I've been thinking of is to add some methods like get_parser_name() # xmllib, xmlproc, pyexpat or XML-Toolkit is_validating() # Only drv_xmlproc_val so far reads_dtd() # Will have to be defined carefully is_fast() # Only pyexpat returns true here to the SAX drivers. This would make the ParserFactory much more powerful and would be nice for other things as well. (Of course, only pyexpat would answer true to the last method.) Anyone against it, or who would prefer something different? --Lars M. From fredrik@pythonware.com Fri May 8 09:52:07 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 8 May 1998 10:52:07 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta Message-ID: <01bd7a5e$9467ff80$f29b12c2@panik.pythonware.com> >> Let me get your proposal clear; you're suggesting that we drop >> having a driver for using xmllib, and just use xmlproc or Expat, >> right? > >No. I guess I'm suggesting we just drop xmllib. Turn it into a wrapper for >xmlproc unless it is faster or otherwise better as it is. My point is that >we should ship xmlproc for the DTD stuff, so once we do that, what's the >point of xmllib anymore? fwiw, we're in the process of releasing our sgmllib/xmllib accelerator. the sgml part is complete; there's still some work to do on the xml stuff. GvR has expressed some interest in shipping this with 1.5.x or what- ever. >We can either think of it as a regex vs. re issue (deprecate one module >for the other) or we could just consider xmllib a deprecated *interface* >to xmlproc (there is probably a precedent for this, too, but I don't know >it). (gotta look at LarsM's SAX interface again, but if it's a direct translation of the Java API, xmllib surely has a more pythonesque feel, IMHO). >Yes. The DOM may also present performance problems. Python objects are a >little heavyweight for large documents. Maybe someone will/should write a >DOM in C. I wouldn't expect it to be that large or complicated. It would >primarily shift the property lookup from bulky hash tables to svelte C >code and (relatively) bulky PyObjects to C strings and pointers. we've recently done that, and as you say, it was pretty easy. haven't decided yet if/how to release it to the public. (it's not DOM on the C interface level, but I don't think adding a DOM-like Python wrapper would be difficult). Cheers /F fredrik@pythonware.com http://www.pythonware.com From fredrik@pythonware.com Fri May 8 10:03:23 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 8 May 1998 11:03:23 +0200 Subject: [XML-SIG] Yet another stupid XML question Message-ID: <01bd7a60$27cdf4e0$f29b12c2@panik.pythonware.com> Paul wrote: >Unicode should definately be higher priority than DOM, IMO. I would go so >far as to say that Unicode should *determine* the date of the next release >of Python (though I don't care if it is called Python 1.5.2 or Python >1.6). Maybe I'm naive about what it means and takes to release a new >version, but what's the harm in perfecting the XML and Unicode stuff and >releasing it as 1.6 in say, three months? It could even be a minor >publicity event. If we were the first with a C implementation, we might >have the fastest DOM implementation at the time of release. Which reminds me of one thing: when I first read the XML specification, I came under the impression that you can determine whether a document uses 8/16/32-bit characters by looking at the first bytes. But I've recently seen a few references that seem to claim that you can also change character sets for each new element. Could anyone sort this out for me? Confused, as usual /F From larsga@step.de Fri May 8 10:13:20 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 08 May 1998 11:13:20 +0200 Subject: [XML-SIG] Yet another stupid XML question References: <01bd7a60$27cdf4e0$f29b12c2@panik.pythonware.com> Message-ID: <3552CCB0.7881E957@step.de> Fredrik Lundh wrote: > > Which reminds me of one thing: when I first read the XML specification, > I came under the impression that you can determine whether a document > uses 8/16/32-bit characters by looking at the first bytes. Sort of. For entities not in UTF-8 or -16 you can do this. Distinguishing between UTF-8 and -16 should also be simple. (Appendix F of the spec explains this.) > But I've recently seen a few references that seem to claim that you > can also change character sets for each new element. That's wrong, but maybe you/they think of/mean entities? When &external_entity; refers to an external entity there's no constraint that the external entity be in the same character set as the referring entity, which is why external entities can have their own XML declaration (the spec calls it a text declaration). xmlproc currently does not handle text declarations correctly, but it will. --Lars M. From larsga@step.de Fri May 8 10:14:19 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 08 May 1998 11:14:19 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <13649.48773.765303.435308@newcnri.cnri.reston.va.us> <3551E8D5.8ACFAC2B@technologist.com> Message-ID: <3552CCEB.80FABB94@step.de> Paul Prescod wrote: > > No. I guess I'm suggesting we just drop xmllib. Turn it into a wrapper for > xmlproc unless it is faster or otherwise better as it is. It's not faster. A speed test on my NT box here in Germany gave these results (in seconds) when parsing quran.xml (1 MB) with saxtimer.py: xmlproc 118 xmlproc_val 142 xmllib 148 xml-toolkit 451 I've since added attribute whitespace normalization to xmlproc, which slowed the non-validating driver down to 123 seconds. The effect on the validating one will depend on the DTD used, since it has to do extra work for all non-CDATA attributes. > The DOM may also present performance problems. Python objects are a > little heavyweight for large documents. Maybe someone will/should write a > DOM in C. I wouldn't expect it to be that large or complicated. It would > primarily shift the property lookup from bulky hash tables to svelte C > code and (relatively) bulky PyObjects to C strings and pointers. Having the choice between Python/C implementations, just like with the parsers sounds very good to me. (Something for the wishlist?) Apropos wishlist: I was bored two nights ago and started an XPointer implementation. I now have a parser and have implemented significant parts of a DOM XPointer implementation. (Basically all relative location terms, with candidate counting, element type filtering and attribute filtering.) Next week will probably see a string of new releases from me: xmlproc 0.31, the XPointer thingy and quite possibly saxlib as well. (No, the official SAX version is not set in stone yet and I'll wait for that before I release saxlib.) This means that the XLL-implementation can be taken of the wishlist and put on the deliverables list. --Lars M. From larsga@step.de Fri May 8 10:20:04 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 08 May 1998 11:20:04 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta Message-ID: <3552CE44.4BBA85DC@step.de> Fredrik Lundh wrote: > > fwiw, we're in the process of releasing our sgmllib/xmllib accelerator. > the sgml part is complete; there's still some work to do on the xml stuff. > GvR has expressed some interest in shipping this with 1.5.x or what- > ever. What's this? More details, please! :) > (gotta look at LarsM's SAX interface again, but if it's a direct translation > of the Java API, xmllib surely has a more pythonesque feel, IMHO). The only real change is that AttributeList supports [] and len() access. > [DOM in C] > > we've recently done that, and as you say, it was pretty easy. haven't > decided yet if/how to release it to the public. Why not? It would certainly be welcome. --Lars M. From fredrik@pythonware.com Fri May 8 11:06:53 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 8 May 1998 12:06:53 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta Message-ID: <01bd7a69$064b6100$f29b12c2@panik.pythonware.com> Lars wrote: >> fwiw, we're in the process of releasing our sgmllib/xmllib accelerator. >> the sgml part is complete; there's still some work to do on the xml stuff. >> GvR has expressed some interest in shipping this with 1.5.x or what- >> ever. >What's this? More details, please! :) It's an incremental parser written in C, which calls various "handle" methods on a provided Python instance. Works pretty much like strop; if you have the module, sgmllib/xmllib will run much faster. The module is small (the current Win32 DLL is 7680 bytes) and pretty fast. >> we've recently done that, and as you say, it was pretty easy. haven't >> decided yet if/how to release it to the public. > >Why not? It would certainly be welcome. Gotta speak with the boss first ;-) >[xmllib] is not faster. A speed test on my NT box here in Germany gave >these results (in seconds) when parsing quran.xml (1 MB) with saxtimer.py: > >xmlproc 118 >xmlproc_val 142 >xmllib 148 >xml-toolkit 451 Early tests with xmllib+sgmlop shows that it's about 5 times faster than pure xmllib (the core parser itself is extremely fast; nearly 10 MB/s on a P2/333 if you don't bother to call any Python callbacks...). Should be possible to make it a bit faster without too much work. Don't have any data yet on the "DOM" stuff (and writing this makes me realize that I should probably change things so that sgmlop can talk directly to the tree builder... if I could only find the time...) Cheers /F fredrik@pythonware.com http://www.pythonware.com From fredrik@pythonware.com Fri May 8 11:18:42 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 8 May 1998 12:18:42 +0200 Subject: [XML-SIG] Yet another stupid XML question Message-ID: <01bd7a6a$ad137ee0$f29b12c2@panik.pythonware.com> >> Which reminds me of one thing: when I first read the XML specification, >> I came under the impression that you can determine whether a document >> uses 8/16/32-bit characters by looking at the first bytes. > >Sort of. For entities not in UTF-8 or -16 you can do this. Distinguishing >between UTF-8 and -16 should also be simple. (Appendix F of the spec >explains this.) So to rephrase my question: based on the first few bytes, you should be able to tell if the file contains 8-bit, 16-bit or 32-bit characters? >> But I've recently seen a few references that seem to claim that you >> can also change character sets for each new element. > >That's wrong, but maybe you/they think of/mean entities? Nope. They mentioned 'elements'. Looks like they were wrong (the amount of hype surrounding XML is starting to eclipse that of Java; I've even seen people talking about writing programs in XML ;-) >When > > &external_entity; > >refers to an external entity there's no constraint that the external >entity be in the same character set as the referring entity, which is >why external entities can have their own XML declaration (the spec >calls it a text declaration). Sounds reasonable. Thanks /F From larsga@step.de Fri May 8 11:52:42 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 08 May 1998 12:52:42 +0200 Subject: [XML-SIG] Re: Yet another stupid XML question References: <01bd7a6a$ad137ee0$f29b12c2@panik.pythonware.com> Message-ID: <3552E3FA.EDB11558@step.de> Fredrik Lundh wrote: > > So to rephrase my question: based on the first few bytes, you > should be able to tell if the file contains 8-bit, 16-bit or 32-bit > characters? Yes. > Nope. They mentioned 'elements'. Looks like they were wrong (the > amount of hype surrounding XML is starting to eclipse that of Java; Good. Let's just hope it's not all wrong. :) It's beginning to look as though XML is going to lead a lot of people into SGML. I see companies that should have picked up SGML years ago now beginning to understand the point of XML. Many of them will of course stay with XML, but many are also better served by SGML. > I've even seen people talking about writing programs in XML ;-) There are a couple of applications that generate stub code from XML files already. IMHO that's sensible although full programming would of course be ...uhmm... somewhat awkward. --Lars M. From larsga@step.de Fri May 8 11:57:42 1998 From: larsga@step.de (Lars Marius Garshol) Date: Fri, 08 May 1998 12:57:42 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta References: <01bd7a69$064b6100$f29b12c2@panik.pythonware.com> Message-ID: <3552E526.FAF3DE14@step.de> Fredrik Lundh wrote: > > It's an incremental parser written in C, which calls various "handle" > methods on a provided Python instance. Works pretty much like > strop; if you have the module, sgmllib/xmllib will run much faster. > The module is small (the current Win32 DLL is 7680 bytes) and > pretty fast. Great! Just a pity we didn't hear of it before, since it seems to do pretty much the same as pyexpat. However, I have one request: can you make it so that it's easy to tell which *mllib is loaded? This is because the SAX ParserFactory will prefer xmllib if it's the C version, but xmlproc if it's the Python version. > Don't have any data yet on the "DOM" stuff (and writing this makes > me realize that I should probably change things so that sgmlop can > talk directly to the tree builder... if I could only find the > time...) Hmmm. The future of Python and XML suddenly looked a lot brighter. Let's hope you find some time soon. --Lars M. From Jack.Jansen@cwi.nl Fri May 8 12:46:09 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Fri, 08 May 1998 13:46:09 +0200 Subject: [XML-SIG] Bits of XML HOWTO available In-Reply-To: Message by Andrew Kuchling , Thu, 7 May 1998 10:37:14 -0400 (EDT) , <13649.50767.270686.606613@newcnri.cnri.reston.va.us> Message-ID: > The Expat module is another thing I was wondering about; is it > essentially finished, or does it still need some work? Last night I > took a look at the module and at Expat's interface, and there didn't > seem to be anything significant missing from the extension, but ISTR > you once said that it still needed to be completed. Well, as I haven't used it heavily yet I'm not _certain_ it is finished, that's probably what I ment to say. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fredrik@pythonware.com Fri May 8 12:47:45 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 8 May 1998 13:47:45 +0200 Subject: [XML-SIG] Re: Yet another stupid XML question Message-ID: <01bd7a77$1df90d80$f29b12c2@panik.pythonware.com> >It's beginning to look as though XML is going to lead a lot of people >into SGML. I see companies that should have picked up SGML years ago >now beginning to understand the point of XML. Many of them will of >course stay with XML, but many are also better served by SGML. Sounds reasonable. It's the usual problem: at first, you think you think your problem is trivial, and that the 500-page ISO specification is far too complicated (and since it's from the eighties, it's probably obsolete anyway), so you go for an ad-hoc solution. When you've added enough requirements to your application, you end up with some- thing that is incredibly messy, extremely complicated, and doesn't really work... I've seen it over and over again, with SGML, various 2D and 3D graphics standards, image file formats, etc. etc. At least this keeps this industry going. Things like Python, XML, Lisp, and Java makes things too easy (in a recent case, we spent more time discussing the contract than on designing, writing, and testing the program...). >> I've even seen people talking about writing programs in XML ;-) > >There are a couple of applications that generate stub code from >XML files already. Sure. Opal uses XML to store project descriptions, UI designs, and structured Python code. Among other things. But the semantics are defined by editors and code generators. XML is just a syntax (1), and we could pickle everything instead. >IMHO that's sensible although full programming would of course >be ...uhmm... somewhat awkward. Well, I still do most of my programming in ASCII ;-) Cheers /F 1) see http://www.scripting.com/frontier5/xml/Updates/chuckHasDoubts.html From Jack.Jansen@cwi.nl Fri May 8 13:04:15 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Fri, 08 May 1998 14:04:15 +0200 Subject: [XML-SIG] Re: saxlib 1.0beta In-Reply-To: Message by Paul Prescod , Thu, 07 May 1998 13:01:09 -0400 , <3551E8D5.8ACFAC2B@technologist.com> Message-ID: > Andrew Kuchling wrote: > > > > Let me get your proposal clear; you're suggesting that we drop > > having a driver for using xmllib, and just use xmlproc or Expat, > > right? > > No. I guess I'm suggesting we just drop xmllib. I think this is a bad idea: xmllib is the only 100% python solution. Even the the sax/dom stuff isn't part of the standard distribution any user wanting to experiment with xml will just have to grab the sax/dom modules over the net and off they go. This is a lot simpler than if they first have to compile extension modules, etc. Of course, this all changes if any of the XML parsers gets incorporated in the Python core, but even then I think keeping xmllib as a fallback method is a good idea. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From akuchlin@cnri.reston.va.us Fri May 8 14:43:19 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Fri, 8 May 1998 09:43:19 -0400 (EDT) Subject: [XML-SIG] Re: saxlib 1.0beta In-Reply-To: <3551E8D5.8ACFAC2B@technologist.com> References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <13649.48773.765303.435308@newcnri.cnri.reston.va.us> <3551E8D5.8ACFAC2B@technologist.com> Message-ID: <13651.2459.406639.511690@newcnri.cnri.reston.va.us> Paul Prescod writes: >Yes. The DOM may also present performance problems. Python objects are a >little heavyweight for large documents. Maybe someone will/should write a >DOM in C. I wouldn't expect it to be that large or complicated. It would One complication might be annotation of the tree; you might want to operate on a tree and add your own attributes. That can be handled in C easily enough (a dictionary object that gets created only if you actually use it), but perhaps some different interface should be used. (I'd hope not, though; 'node.myAttr = 42' would be so natural.) >I was waiting for the next SAX release, but now I'm in the final stages of >a book. Keep me updated on your progress and I'll try to squeeze my stuff >in before 1.0. I could do a tutorial and an example program. 1.0 is still a way off, I think; with just SAX, I'd call it version 0.6 or .7 or thereabouts. (Or do people think it should be 1.0, and DOM/other stuff would be 1.1 or 2.0? Feel free to disagree with me...) Concentrate on your book, because I want to get a copy as soon as possible. :) >1.6). Maybe I'm naive about what it means and takes to release a new >version, but what's the harm in perfecting the XML and Unicode stuff and >releasing it as 1.6 in say, three months? It could even be a minor That's Guido's decision. Personally, I want to see Unicode in the next Python, and also do some major work on PCRE, which unfortunately keeps getting buried deeper on my stack--ARGH! (My project page says "Work should start in April"--yeah, right.) The Unicode work thus far has focused on a wide character string type, and making it seamlessly work with regular strings (at least at the Python level)--now I'm starting to wonder if using UTF-8 for everything would be better. That's a String-SIG problem... -- A.M. Kuchling http://starship.skyport.net/crew/amk/ The lecturer should give the audience full reason to believe that all his powers have been exerted for their pleasure and instruction. -- Michael Faraday From akuchlin@cnri.reston.va.us Fri May 8 14:59:50 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Fri, 8 May 1998 09:59:50 -0400 (EDT) Subject: [XML-SIG] Re: saxlib 1.0beta In-Reply-To: <3552C351.5DEEDBF3@step.de> References: <13646.5522.116243.65383@newcnri.cnri.reston.va.us> <354EBD1D.8AE2CED0@step.de> <13647.9836.230760.746014@newcnri.cnri.reston.va.us> <354F2AC1.95F6DD67@step.de> <13647.12442.68887.467809@newcnri.cnri.reston.va.us> <35519A3D.320176D9@technologist.com> <35519F13.DFF83B0B@step.de> <13649.51922.308079.993134@newcnri.cnri.reston.va.us> <3552C351.5DEEDBF3@step.de> Message-ID: <13651.3299.270729.883225@newcnri.cnri.reston.va.us> Lars Marius Garshol writes: >Well, we don't preserve entity information either, or whitespace in >tags, so writing an XML editor on top of this is probably not going to >work anyway. In fact, all XML parsers I know of today throws away Hm. IIRC, the DOM code does let you include comment objects in a tree that you build; perhaps that's enough. All right; we can let this slide for the moment, and worry about writing XML editors later. >> Sharing drivers would also let both SAX and DOM use an ESIS driver, or >> anything else that gets written. >That we can do already, since ESIS does not contain comments. (Although >one can pick out entity boundaries from an ESIS stream, although not >from the one generated by nsgmls, if I remember correctly.) Hm, again. My concern with layering DOM always on top of SAX is performance; will the extra layer of calls cost very much? The problem is avoiding writing O(n**2) drivers, of course, connecting n parsers to n different user-level APIs. What does everyone think about this? >If we think people will use the DOM without SAX I think we should do >that, yes. Certainly we'll provide various helper functions which perform the most common tasks of reading from a file object, and they may well use SAX internally to build a DOM tree, or whatever. The marshal module I posted a while back uses both SAX and DOM, for example. > get_parser_name() # xmllib, xmlproc, pyexpat or XML-Toolkit > is_validating() # Only drv_xmlproc_val so far > reads_dtd() # Will have to be defined carefully > is_fast() # Only pyexpat returns true here Good idea. One formatting question: should method names be in the words_separated_by_underscores style, or in the mixedCaseStudlyCaps variety? SAX seems to follow the studlycaps route. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ The purpose of the present course is the deepening and development of difficulties underlying contemporary theory... -- A. A. Blasov From papresco@technologist.com Fri May 8 16:49:25 1998 From: papresco@technologist.com (Paul Prescod) Date: Fri, 08 May 1998 11:49:25 -0400 Subject: [XML-SIG] Re: saxlib 1.0beta References: Message-ID: <35532985.D4A6E7F2@technologist.com> Jack Jansen wrote: > > Of course, this all changes if any of the XML parsers gets incorporated in the > Python core, but even then I think keeping xmllib as a fallback method is a > good idea. As a fallback, it's fine. As the default, "standard", I don't like it. It isn't as powerful, it isn't much simpler, and it is idiosyncratic. Methods like setnomoretags, setliteral, translate_references, handle_doctype, handle_entityref, (debatably) handle_comment, handle_cdata, handle_special, unknown_starttag and unknown_endttag made sense in the context of a tool specifically for Grail, but not much in the context of generalized XML processing. XML has no equivalent to some of them, and the parser should handle others "all by itself". So I don't think that the interface is intuitive to an XML hacker. And the implementation needs a bunch of work also, because again it is a nice tweak of code that was fundamentally designed for Grail. If SAX isn't Python-esque enough, then I would prefer to touch it up and create a delegating wrapper for handling non-Python parsers. Then we will have the best of both worlds: XML optimized and Python optimized. Paul Prescod - http://itrc.uwaterloo.ca/~papresco Can we afford to feed that army, while so many children are naked and hungry? Can we afford to remain passive, while that soldier-army is growing so massive? - "Gabby" Barbadian Calpysonian in "Boots" From larsga@ifi.uio.no Thu May 14 08:44:38 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 14 May 1998 09:44:38 +0200 Subject: [XML-SIG] saxlib 1.0beta released Message-ID: I've just released saxlib 1.0beta. This version has a number of bugfixes to the previous snapshot and has a new home page at Please note that the xmlproc drivers are for version 0.40 and do _not_ work with the current version. I hope to have 0.40 out this week. The release plan is to put out a beta2 with several new drivers as well as some new extensions quite soon, and then some time after that I'll expect the final version. -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From larsga@ifi.uio.no Sun May 17 18:00:17 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 17 May 1998 19:00:17 +0200 Subject: [XML-SIG] xmlproc 0.40 Message-ID: I just released xmlproc 0.40. It now implements nearly all of the XML 1.0 specification (known deviations are listed on the home page), some bugs have been removed, and DTD information is now accessible in several ways. A speedup release (0.41) with some bug fixes and improved conformance will hopefully arrive within a month. The URL is still: -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From fredrik@pythonware.com Mon May 18 13:22:13 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 18 May 1998 13:22:13 +0100 Subject: [XML-SIG] sgmlop snapshot released Message-ID: <01bd8257$96a0ad00$f29b12c2@panik.pythonware.com> I've just uploaded a development snapshot of sgmlop (the sgmllib/xmllib accelerator plugin) to: http://www.pythonware.com/madscientist/ sgmlop is a fast replacement for the regular expression-based parsers used in the sgmllib/htmllib and xmllib modules. A single module supports both SGML and XML. sgmlop is currently about 5 times faster than the original re-based implementation provided with Python 1.5. this snapshot includes source code and a precompiled DLL for Python 1.5. building the library on Unix and other platforms should be straightforward. Enjoy /F fredrik@pythonware.com http://www.pythonware.com From dkuhlman@enterpriselink.com Mon May 18 19:43:38 1998 From: dkuhlman@enterpriselink.com (Dave Kuhlman) Date: Mon, 18 May 1998 11:43:38 -0700 Subject: [XML-SIG] Re: XML Filesystem References: Message-ID: <3560815A.766C9A3A@EnterpriseLink.com> Please let me add my support and requests along these lines -- To make XML most useful to me, it would provide the following: 1. An easy, seamless way to convert XML text files into and out of a light-weight object-oriented database. This database would have just enough OO functionality to support the XML DOM. 2. I would be able to perform this conversion (a) from the command line, (b) using a GUI utility, and (c) programmatically. Anyone could, for example, extract chunks out of the database so as to produce XML text to be fed to an XML capable Web browser (Mozilla 5.0, I'm hoping) or to an XML capable JavaBean or to ... 3. I would have an API that was exactly (not just closely) the same as the DOM API for XML. This API would enable me to write an application to access and modify XML documents/objects "in situ", i.e. "in place", within the database without reading in and writing out an entire (text) document. 4. The implementation of the light-weight database would be Open Source; I could distribute it anywhere; and I could compile it into my C/C++ application. I could include a package that supports it into my Java application. 5. My favorite scripting languages (Perl, Python, and Tcl) would all come with support for this light-weight database and the XML DOM API that enables me to use it. (Maybe I would have to compile this support into my language of choice as an optional module.) 6. There would also be some support for storing multiple XML documents in a database in an hierarchical directory structure within the database. I'm not sure what parts of this support, if any, needs to be at the operating system level, nor at what level in the operating system. That's partially because I'm not particularly well informed about operating system design. But, it's also because I've been following the Perl and Python XML mailing lists (you notice that I cross posted to the Python XML SIG), and I like very much the direction that Perl and Python XML support is going. Providing this support at the language level makes good sense to me, especially if it is based on code that is a bit more general than a specific language, as the use of James Clark's expat is, and maybe a light-weight OODB could be. If XML does become the "next big thing", I believe that the XML support in Perl and Python will help it get there. Thanks lots. I apologize in advance for the verbogosity. I've been tossing and turning at night over this for some time, now. Dave Matt Sergeant wrote: > > Hi, > > I have the opportunity of influencing a new operating system, and I > would like everyone on this list to give their opinion on this. My idea > is to have XML support at the system services level (one layer above the > kernel), as well as support for ordinary files. New files should be > created in XML format as preference (obviously except those that are for > a specific purpose for another platform, eg zip files). The dtd would be > able to describe to the system the operations that can be performed on > the file, and possibly even how to perform those operations. > > This system would have to be able to cope with binary files too (I > assume this is not a problem). > > One of the advantages of this would be that there are now two levels of > file system corruption - at the file system itself, and also when a file > is not well formed + valid. > > Please give your opinions so that I don't go advocating this elsewhere > if it's not a good idea. > > Matt. > > -- > Fastnet Software Ltd. Perl Consultant. Web Development. > See: http://www.fastnetltd.ndirect.co.uk for more details > Also Perl-Win32 Database and ASP FAQ's and modules: > http://www.geocities.com/SiliconValley/Way/6278 > > ..................................... > To leave this list, send an email message to ListManager@ActiveState.com > with the following text in the body: Unsubscribe Perl-XML > For non-automated Mailing List support, send email to ListHelp@ActiveState.com -- Dave Kuhlman EnterpriseLink Technology Corp http://www.enterpriselink.com 2542 S. Bascom Ave., Suite #203 Campbell, CA 95008 dkuhlman@EnterpriseLink.com 408-558-2011 From ken@bitsko.slc.ut.us Tue May 19 04:11:32 1998 From: ken@bitsko.slc.ut.us (Ken MacLeod) Date: 18 May 1998 22:11:32 -0500 Subject: [XML-SIG] Re: XML Filesystem In-Reply-To: dkuhlman@enterpriselink.com's message of Mon, 18 May 1998 11:43:38 -0700 References: <3560815A.766C9A3A@EnterpriseLink.com> Message-ID: dkuhlman@enterpriselink.com (Dave Kuhlman) writes: > Please let me add my support and requests along these lines -- To > make XML most useful to me, it would provide the following: [please refer to original message, a mere snip just won't do] Several of the features you mention are similar to what we're trying to do in the Casbah project . Casbah is intended to be a language agnostic scripting environment with a persistent store and APIs for communicating using CORBA or similar remote object calls. Casbah features a hierarchical persistent store (think nested dictionaries and lists, with the dictionaries able to take a ``class'' attribute), a language agnostic runtime(s) closely integrated to the store, a GUI front-end, and HTTP, CORBA, and CORBA-like APIs for communicating between parts of the system. The hierarchy is actually distributed, we're using a design patterned after virtual file systems -- multiple persistent stores can be ``mounted'' in your namespace, as well as web and ftp servers, virtual drivers, the local filesystem, etc. In addition to using XML extensively for interfacing, I'm also working on having XML object classes similar to what XML::Grove has. Basically you could have a directory in the hierarchy that contains XML documents or fragments underneath it just by virtue of their being XML objects. By coincidence, I'm hoping sometime in the next few weeks to get an example ``virtual driver'' that reads an XML document using XML::Grove and make it available via the Casbah API. -- Ken From Jack.Jansen@cwi.nl Tue May 19 10:09:57 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Tue, 19 May 1998 11:09:57 +0200 Subject: [XML-SIG] Re: XML Filesystem In-Reply-To: Message by dkuhlman@enterpriselink.com (Dave Kuhlman) , Mon, 18 May 1998 11:43:38 -0700 , <3560815A.766C9A3A@EnterpriseLink.com> Message-ID: I'm not sure I see the advantage of having XML-support in the kernel. When I was still in OS design the thing we always did was move everything out of the kernel unless it was absolutely vital that it was in. This was mainly a question of moving things from the microkernel to user-level OS services, but the same argument holds to a lesser extent for the choice of putting functionality in OS services or application libraries. When something is in the OS it becomes more difficult to maintain, and moreover applications can't decide to use a newer version, etc. I'd say the only reason to put something in the kernel is if this gives you functionality not available otherwise. An example would be access control on a more fine-grained level than a file ("you're allowed to read all the

sections in this document, but we won't give the the

s") or concurrency control (allowing multiple applications to modify different parts of the same document at the same time). I'm not sure, however, that any of these would be applicable to the area of XML... -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From Jack.Jansen@cwi.nl Tue May 19 12:39:10 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Tue, 19 May 1998 13:39:10 +0200 Subject: [XML-SIG] Pyexpat (formerly xmltok) released Message-ID: I've built a release of pyexpat, the new name for the wrapper module around James Clarks expat module (formerly known as xmltok). Aside from the name change there are only very minor changes. The distribution is available in two flavors: ftp://ftp.cwi.nl/pub/jack/python/pyexpat.tgz - Gzipped tarfile with source to the module and expat. ftp://ftp.cwi.nl/pub/jack/python/pyexpat.hqx - BinHexed Stuffit archive with compiled plugin modules for the Mac, Python 1.5.1. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fredrik@pythonware.com Wed May 20 10:41:18 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 20 May 1998 10:41:18 +0100 Subject: [XML-SIG] New sgmlop snapshot released Message-ID: <01bd83d3$7056b270$f29b12c2@panik.pythonware.com> I've just uploaded an updated sgmlop development snapshot to: http://www.pythonware.com/madscientist/ news: XML support through xmllib is now up and running. modified versions of both sgmllib and xmllib are included in this release. sgmlop is currently about 5 times faster than the original versions (both SGML and XML). judging from tests with the raw sgmlop parser, it should be possible to speed things up another 3-5 times with the current API, and even more when sgmlop is combined with our (yet to be released) tree builder. Enjoy /F fredrik@pythonware.com http://www.pythonware.com From akuchlin@cnri.reston.va.us Wed May 20 20:39:39 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Wed, 20 May 1998 15:39:39 -0400 (EDT) Subject: [XML-SIG] First cut at SAX tutorial Message-ID: <13667.11456.431157.909770@newcnri.cnri.reston.va.us> I've added some hastily-written tutorial material about SAX to the current draft of the HOWTO; see http://www.python.org/doc/howto/xml/ . No reference material yet, but I'll get around to it in the next day or two, hopefully, and then revise the SAX tutorial; I want SAX documented reasonably well before the first test release. Since many users will pattern their code on what they see in the tutorial, it's important that it not give them any bad habits, so please be merciless in reporting things like suboptimal implementation, or errors in terminology. The organization isn't very optimal, but I'm not sure what the best organization is. It begins with startElement(), digresses to talk about error handling, and then covers characters() and stopElement(), which will probably be enough for many people. Probably there should be another subsection on other methods in the DocumentHandler interface: endDocument() is the most important, since ignorableWhitespace() and processingInstruction() seem most useful to advanced users, who probably don't need a tutorial, and setDocumentLocator() is for parser writers. I'm not sure if there should be a tutorial section on the DTDHandler interface; again, that seems fairly advanced. Something I'm not sure of: are there any cases where the user has to perform entity substitution themselves, such as turning é into the right character, or would any such XML parser be considered broken? (For example, what if it's not a standalone document, and the parser doesn't read the DTD. Wondering if I need to document how to do that...) -- A.M. Kuchling http://starship.skyport.net/crew/amk/ Old friend, in my mind only do I write you this letter, but it is a splendid letter, with perfect brushwork. Old hands do not shake or cramp when the letter is written on the air. -- Master Li, in SANDMAN #74, "The Exile" From akuchlin@cnri.reston.va.us Wed May 20 21:58:38 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Wed, 20 May 1998 16:58:38 -0400 (EDT) Subject: [XML-SIG] What's in the package? Message-ID: <13667.17308.510402.524291@newcnri.cnri.reston.va.us> I'm going to start working on packaging the XML software into a single distribution, and am wondering about what exactly should go into it. The candidates are: 1 saxlib 2 xmlproc 3 DOM 4 pyexpat extension, plus the source code for Expat 5 sgmlop extension 6 xmllib.py modified to use sgmlop 1, definitely. 2,4, and 5, probably. #6, the sgmlop-aware xmllib.py, would be good, but I'm not sure where to install it. Overwriting the existing xmllib.py is evil; perhaps as xml.parser.xmllib? I'm not sure about DOM; it's not a standard API yet, and I don't know how closely it matches the current working draft. (Haven't had time to check.) On the other hand, I wouldn't mind including it without documentation, and with a warning about the non-existence of a standard. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ What our ancestors would really be thinking, if they were alive today, is: "Why is it so dark in here?" -- Terry Pratchett, _Pyramids_ From akuchlin@cnri.reston.va.us Fri May 22 15:12:47 1998 From: akuchlin@cnri.reston.va.us (Andrew Kuchling) Date: Fri, 22 May 1998 10:12:47 -0400 (EDT) Subject: [XML-SIG] Putting the pieces together... Message-ID: <13669.33812.859425.877215@newcnri.cnri.reston.va.us> I've made a crude first cut at a distribution which packages all the XML software together. "make install" doesn't work yet, because it looks to be fairly complicated, but you should be able to unpack it on a Unix system, type "make" and have everything compile itself neatly. This will probably be broken on some systems; please try it out and report problems. The distribution includes pretty much everything we have: saxlib, xmlproc, sgmlop, pyexpat, and the DOM code. It's at: http://www.python.org/sigs/xml-sig/files/xml-package.tgz Things that are missing: * As mentioned, "make install" doesn't work. * The documentation isn't included yet; I want to include the HOWTO in various forms, and have to write a Makefile target to build them all. * Should xml.sax.saxutils.ErrorPrinter go to stdout and not stderr? Should Canonizer be renamed to Canonicalizer? * Where should the C extension modules for sgmlop and pyexpat be installed? Inside the xml package tree, or simply where all the other dynamically loadable modules go? * What should the various __init__.py files do? Put another way, when you do "import xml", what does dir(xml) contain? * demo/ directory needs to be documented * Test suite needs to be written. * What are the licence terms for all the components? What should the licence be for the whole distribution? (Python-style?) * The documentation still needs work. Doubtless people will report other problems and shortcomings. My hope is that we can discuss the distribution on the xml-sig and fix compiling and installation problems fairly quickly (perhaps for 2-3 weeks?). Once we're fairly satisfied with it, we can make an informal announcement, people can then start trying to do real work with it, which will doubtless turn up bugs in interfaces and parsers. Fixing those shouldn't require major reorganizations of the package structure, so the code could be considered beta status at that point. -- A.M. Kuchling http://starship.skyport.net/crew/amk/ Truth I have no trouble with, it's the facts I get all screwed up. -- Farley Mowat From fredrik@pythonware.com Sat May 23 14:15:01 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Sat, 23 May 1998 14:15:01 +0100 Subject: [XML-SIG] Yet another sgmlop snapshot released Message-ID: <01bd864c$ca9d8800$f29b12c2@panik.pythonware.com> I've just uploaded yet another sgmlop development snapshot to: http://www.pythonware.com/madscientist/ -- attributes are now parsed by C code. this gives a considerable speedup for files using lots of attributes. -- added some extra support for saxlib (see saxhack.py for some sample code). Roughly, an sgmlop-based saxlib parser should be about 30 times faster than one based on the old sgmllib. If you run with empty callbacks, about 20% of the time is parsing and 80% of the time is Python method call overhead. Guess it's time to start hacking on the interpreter ;-) I'll be off-line until monday; it would be cool if someone else could turn saxhack.py into a real saxlib parser in the meantime! Enjoy /F fredrik@pythonware.com http://www.pythonware.com From larsga@ifi.uio.no Mon May 25 13:39:06 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 25 May 1998 14:39:06 +0200 Subject: [XML-SIG] Putting the pieces together... In-Reply-To: <13669.33812.859425.877215@newcnri.cnri.reston.va.us> References: <13669.33812.859425.877215@newcnri.cnri.reston.va.us> Message-ID: * Andrew Kuchling | | * Should xml.sax.saxutils.ErrorPrinter go to stdout and not | stderr? No. (Will be fixed.) | Should Canonizer be renamed to Canonicalizer? Possibly. Someone who knows English a little better than me can perhaps tell us what the correct form is? Your finalization plan sounded good to me. -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From fredrik@pythonware.com Mon May 25 15:17:52 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 25 May 1998 15:17:52 +0100 Subject: [XML-SIG] Putting the pieces together... Message-ID: <01bd87e7$e76f7b40$f29b12c2@panik.pythonware.com> >Doubtless people will report other problems and shortcomings. > >My hope is that we can discuss the distribution on the xml-sig and fix >compiling and installation problems fairly quickly (perhaps for 2-3 >weeks?). Once we're fairly satisfied with it, we can make an informal >announcement, people can then start trying to do real work with it, >which will doubtless turn up bugs in interfaces and parsers. Fixing >those shouldn't require major reorganizations of the package >structure, so the code could be considered beta status at that point. Just for the record, I've been tinkering with saxlib over the week- end, and there are a few things that I would really like to discuss before the design is frozen. I'll try to post my findings tomorrow. Cheers /F From digitome@iol.ie Mon May 25 15:39:44 1998 From: digitome@iol.ie (Sean Mc Grath) Date: Mon, 25 May 1998 15:39:44 +0100 Subject: [XML-SIG] Putting the pieces together... Message-ID: <199805251439.PAA09015@GPO.iol.ie> > >| Should Canonizer be renamed to Canonicalizer? > Yes I think so. After all - no one expects the Spanish Inquisition:-) Sean Mc Grath http://www.digitome.com/sean.htm County Sligo, Ireland, Tel: +353 96 47391 From papresco@technologist.com Mon May 25 17:27:11 1998 From: papresco@technologist.com (Paul Prescod) Date: Mon, 25 May 1998 12:27:11 -0400 Subject: [XML-SIG] First cut at SAX tutorial References: <13667.11456.431157.909770@newcnri.cnri.reston.va.us> Message-ID: <35699BDF.A34D1E9B@technologist.com> Andrew Kuchling wrote: > > Something I'm not sure of: are there any cases where the user has to > perform entity substitution themselves, such as turning é into > the right character, or would any such XML parser be considered > broken? (For example, what if it's not a standalone document, and the > parser doesn't read the DTD. Wondering if I need to document how to > do that...) Yes, it is possible that an XML parser could pass an entity reference instead of the contents of an entity to the application. Let me try to clarify a few things: All processors must read at least part of the DTD. But they do not have to read all of the DTD (e.g. they may skip external parts) When they do not read the full DTD, they cannot expand some external entities. Even when they do read the full DTD, they can choose not to expand some (any!) external entities, as long as the processor does not claim to be a validating parser. Paul Prescod - http://itrc.uwaterloo.ca/~papresco "A writer is also a citizen, a political animal, whether he likes it or not. But I do not accept that a writer has a greater obligation to society than a musician or a mason or a teacher. Everyone has a citizen's commitment." - Wole Soyinka, Africa's first Nobel Laureate From djad022@uce.ac.uk Tue May 26 12:51:32 1998 From: djad022@uce.ac.uk (Daniel Biddle) Date: Tue, 26 May 1998 12:51:32 +0100 (BST) Subject: [XML-SIG] Putting the pieces together... In-Reply-To: from "Lars Marius Garshol" at May 25, 98 02:39:06 pm Message-ID: <199805261155.HAA29317@python.org> Lars Marius Garshol wrote: > > * Andrew Kuchling > | > | Should Canonizer be renamed to Canonicalizer? > > Possibly. Someone who knows English a little better than me can > perhaps tell us what the correct form is? Here's what WWWebster () has to say: | Main Entry: can7on7ize | Pronunciation: 'ka-n&-"nIz | Function: transitive verb | Inflected Form(s): can7on7ized /-"nIzd; in "Hamlet" usually k&-'nd-"nIzd/; | can7on7iz7ing | Etymology: Middle English, from Medieval Latin canonizare, from Late Latin | canon catalog of saints, from Latin, standard | Date: 14th century | 1 : to declare (a deceased person) an officially recognized saint | 2 : to make canonical | 3 : to sanction by ecclesiastical authority | 4 : to attribute authoritative sanction or approval to | 5 : to treat as illustrious, preeminent, or sacred | - can7on7i7za7tion /"ka-n&-n&-'zA-sh&n/ noun Meaning 2 is obviously the one we want. Have any of the others been implemented in software? B-) There's no entry for 'canonicalize', but I also found | Main Entry: canonical form | Function: noun | Date: 1851 | : the simplest form of something; [...] Let's keep names in their simplest forms: less typing! -- Daniel Biddle | M a i l - i n B l o c k | 2nd year BSc (Hons) | ------------------------- | Software Engineering | protecting your mail from | ... pedant and lurker | the spam of the universe! | From tjreedy@UDel.Edu Wed May 27 00:57:34 1998 From: tjreedy@UDel.Edu (Terry Reedy) Date: Tue, 26 May 98 16:57:34 PDT Subject: [XML-SIG] Putting the pieces together... In-Reply-To: <199805261155.HAA29317@python.org> References: Conversation with last message <199805261155.HAA29317@python.org> Message-ID: >Let's keep names in their simplest forms: less typing! I agree. The neologism 'canonicalize' is pretty ugly, Terry From larsga@ifi.uio.no Wed May 27 22:10:09 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 27 May 1998 23:10:09 +0200 Subject: [XML-SIG] PyPointers released Message-ID: I've just put out an experimental release of my XPointer implementation at Like I wrote, it's experimental, but the parser should be complete and all relative location terms except 'preceding' are implemented. The DOM locator is built on PyDOM and returns DOM nodes. Feedback in any form and on any aspect of this is of course most welcome. -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From fredrik@pythonware.com Thu May 28 22:58:25 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 28 May 1998 22:58:25 +0100 Subject: [XML-SIG] Mr. Nitpicker looks at saxlib Message-ID: <01bd8a83$bce958c0$f29b12c2@panik.pythonware.com> (sorry, but I've got distracted. Darn those paying customers ;-) Here's some comments on saxlib, based on the HOWTO document, a quick look at the sources, and some experiences from the sgmlop- based coreXML parser I've written for our RDE and MIOW projects. (I really should have looked closer at the sources, and read the SAX spec again, but will probably not get around to do that before the weekend... feel free to flame away if I've misunderstood every- thing) important issues ------------------- 1. Performance #1: Should the "characters" method really take start/length arguments? I suppose this is a direct mapping of the Java SAX spec, but it has one serious drawback: the string slicing operator copies the string, which means that you'll end up with an extra string copy when you use fast parsers like sgmlop and pyexpat: - parser copies data into a python string - driver calls "characters" with string, start=0, and length=len(string) - user-defined class does string[offset:offset+length], which copies the string again (- the user class does self.data = self.data + string[...], which copies the string yet another time. sigh...) I'd say we might as well get rid of those two arguments, and leave it to the parser to slice and dice. Or if you insist, you could at least change start/length to start/end... 2. Usability: There's no "feed" method. While it is perfectly valid to assume threading for Java, I don't think this is a valid requirement for Python code. Since sgmlop, xmllib, and pyexpat all support incremental parsing (and since our stuff is event-driven...), it would be good if saxlib exposed these methods in some way. somewhat important issues -------------------------------- 3. Performance: Is the AttributeList class really necessary? Wouldn't it be enough to use a good ole dictionary? 4. Performance and usability: sgmllib and xmllib currently allows you to implement a "static DTD" via start_xxx, end_xxx, and do_xxx methods. While this cannot be used to handle all kinds of DTD's, it sure makes it easier to implement simple parsers. consider: def startElement(self, name, attrs): # If it's a comic element, save the title and issue if name == 'comic': self.this_title = attrs.get('title', "") self.this_number = attrs.get('number', "") # If it's the start of a writer element, note that fact elif name == 'writer': self.inWriterContent = 1 self.writerName = "" def endElement(self, name): if name == 'writer': self.inWriterContent = 0 if self.search_name == self.writerName: print 'Found:', self.this_title, self.this_number vs. def start_comic(self, attrs): self.this_number = attrs.get("number", "") def start_writer(self, attrs): self.inWriterContent = 1 self.writerName = "" def end_writer(self): self.inWriterContent = 0 if self.search_name = self.writerName: print 'Found:', self.this_title, self.this_number or even: def start_comic(self, number="", **attrs): self.this_number = number (etc) This also makes it possible to speed things up (the parser can cache the bound methods to minimize the number of lookups and extra comparisions) 5. Usability: the coreXML parser exposed the internal tag stack used to check that elements are properly closed. The result is that you can write things like: def startElement(self, name, attrs): if self.tags[-2:] == ["comic", "writer"]: ... which is, IMHO, pretty cool. 6. Usability: htmllib (!) provides save_bgn and save_end methods in the baseclass which implements that self.data = self.data + ... stuff that everyone has to implement anyway... should saxlib provide something similar? def start_writer(self, attrs): self.save_bgn() def end_writer(self): writer = self.save_end() ... 7. Should the API be tweaked to adhere to the Python style guidelines? That is, should startElement be start_element instead? http://www.python.org/doc/essays/styleguide.html 8. Shipping. while it's obvious that saxlib with all drivers and utilities should be included in the big everything-in-a-single-package XML add-on, I'm not sure everything that could fit into that package should be distributed with the Python core (at least if Guido still adhers to the "if I cannot hack it, I don't want it in the core" principle). But I think saxlib+xmllib+sgmlop should be part of the standard library in future releases. What do you think? 9. Should sgmlop perhaps be renamed to xmlop? 10. May I go home now? Cheers /F fredrik@pythonware.com http://www.pythonware.com From larsga@ifi.uio.no Thu May 28 23:14:22 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 29 May 1998 00:14:22 +0200 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib In-Reply-To: <01bd8a83$bce958c0$f29b12c2@panik.pythonware.com> References: <01bd8a83$bce958c0$f29b12c2@panik.pythonware.com> Message-ID: * Fredrik Lundh | | feel free to flame away if I've misunderstood everything *straps on flamethrower* | 1. Performance #1: Should the "characters" method really take start/length | arguments? | | I suppose this is a direct mapping of the Java SAX spec, It is. | but it has one serious drawback: the string slicing operator | copies the string, Just a thought: should that be changed, since strings are supposed to be mutable anyway and this is such a common operation? | which means that you'll end up with an extra string copy when you | use fast parsers like sgmlop and pyexpat However, it has the advantage that parsers like xmlproc don't have to do that extra string copy, since what xmlproc gives you is a restricted view of the internal xmlproc data buffer. I timed the parser before and after I implemented this and the speed increase was significant. Is it impossible to do something similar in sgmlop to avoid this overhead? We should also think about where speed is most critical: in slow Python parsers or fast C parsers... | Or if you insist, you could at least change start/length to | start/end... Well, that means JPython integration goes down the drain and introduces a subtle, but important difference in the API that's likely to cause major pain. I should know. In the first SAX translation I got this backwards in the xmlproc driver, but correctly in the others. Nobody ever complained, but it bit me, and it took me a while to figure out where the problem was. IMHO, having two so similar APIs when many people are likely to be using both is just asking for trouble. | 2. Usability: There's no "feed" method. While it is perfectly valid | to assume threading for Java, I don't think this is a valid | requirement for Python code. Since sgmlop, xmllib, and pyexpat | all support incremental parsing (and since our stuff is | event-driven...), it would be good if saxlib exposed these methods | in some way. Well, the only parsers that can't support this are nsgmls wrappers (which will happen at some point) and XML-Toolkit. However, adding it means extending SAX and knowing that some parsers will not support it. I'm thinking of extending the Parser interface with some more methods that are not part of SAX 1.0 anyway, so perhaps we can do this in a more controlled fasion. My plan was to keep saxlib pure, but to add a number of optional methods in a subclass of saxlib.Parser in saxexts and implement these in all parser drivers. How about this: # --- Experimental extension to Parser interface class ExtendedParser(saxlib.Parser): "Experimental unofficial SAX level 2 extended parser interface." def get_parser_name(self): "Returns a single-word parser name." raise saxlib.SAXException("Method not supported.",None) def get_parser_version(self): """Returns the version of the imported parser, which may not be the one the driver was implemented for.""" raise saxlib.SAXException("Method not supported.",None) def get_driver_version(self): "Returns the version number of the driver." raise saxlib.SAXException("Method not supported.",None) def is_validating(self): "True if the parser is validating, false otherwise." raise saxlib.SAXException("Method not supported.",None) def is_dtd_reading(self): """True if the parser is non-validating, but conforms to the spec by reading the DTD.""" raise saxlib.SAXException("Method not supported.",None) def reset(self): "Makes the parser start parsing afresh." raise saxlib.SAXException("Method not supported.",None) def feed(self,data): "Feeds data to the parser." raise saxlib.SAXException("Method not supported.",None) def get_stack(self): "Returns the current element stack." raise saxlib.SAXException("Method not supported.",None) If we decide to do this I'll write a more formal specification for this interface. I'm also thinking that ParserFactory should have four lists of parsers: the current list, a list of HTML parsers, a list of SGML parsers and a list of extended SAX drivers. | 3. Performance: Is the AttributeList class really necessary? Wouldn't | it be enough to use a good ole dictionary? Using dictionaries would mean losing attribute type information. This is important to be able to identify the different attribute types. DOM will have to do this and many other kinds of system as well. At present, only xmlproc supports this, but by the sound of it Pyexpat will also be validating at some point. Also, the current Python version of AttributeList has the added advantage that it can be used like this: print "<%s" % name for attr in attrs: print " %s=%s" % (attr,attrs[attr]) Then there's JPython integration and all that. IMHO AttributeList should stay. Those who want all-out maximum overdrive __raw__ speed above all else should not use SAX (or even Python) anyway. | 4. Performance and usability: sgmllib and xmllib currently allows | you to implement a "static DTD" via start_xxx, end_xxx, and | do_xxx methods. While this cannot be used to handle all kinds of | DTD's, it sure makes it easier to implement simple parsers. There is of course very good sense in this. I think the best way to go about this is to add it as a separate layer on top of SAX and not as a part of the SAX interface required to be implemented by parsers. That would slow us down, and many important SAX users like DOM (and my sax2obj) would never use it at all. I think the best solution would be to make a DispatchDocHandler subclass of DocumentHandler and let those who want to use this instead. | This also makes it possible to speed things up (the parser can | cache the bound methods to minimize the number of lookups and | extra comparisions) I agree, but this shouldn't be the responsibility of parser drivers, but rather of a single class. DispatchDocHandler can very well be implemented in this way. | 5. Usability: the coreXML parser exposed the internal tag stack used | to check that elements are properly closed. This was cool, I agree, and much more so in Python than in Java. All parsers must keep this information anyway, and in special cases like the nsgmls wrapper where it is not available the driver can keep track of the stack behind the scenes. | 6. Usability: htmllib (!) provides save_bgn and save_end methods in | the baseclass which implements that self.data = self.data + | ... stuff that everyone has to implement anyway... should saxlib | provide something similar? Well, this could be implemented in the extended drivers and might have some advantages, but I personally don't really want this feature. Any other opinions? | 7. Should the API be tweaked to adhere to the Python style | guidelines? IMHO: no. It's too late now. I've got lots of code that uses this style, you've got code that in it, the DOM has it, the tutorial uses it, Paul Prescod has used it and probably many others. saxlib has been downloaded 85 times in the past month so there's probably quite a bit of code built on it already. I'm sorry, but I think this particular train has left. | 8. Shipping. [...] I think saxlib+xmllib+sgmlop should be part of | the standard library in future releases. What do you think? OK for me. :-) | 9. Should sgmlop perhaps be renamed to xmlop? XML is SGML, so if you want to support both I think you should keep the name. If you decide to skip SGML support altogether a name change would be more appropriate. | 10. May I go home now? Certainly. *puts away flamethrower* Thank you ever so much for the feedback. I've wanted feedback on most of the issues you raise here, but never got it. I'm glad to see it come now, even if it is a little late. -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From Jack.Jansen@cwi.nl Fri May 29 09:35:34 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Fri, 29 May 1998 10:35:34 +0200 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib In-Reply-To: Message by Lars Marius Garshol , 29 May 1998 00:14:22 +0200 , Message-ID: > | but it has one serious drawback: the string slicing operator > | copies the string, > > Just a thought: should that be changed, since strings are supposed to > be mutable anyway and this is such a common operation? This is difficult. I looked into this a few years ago (add a "parent" pointer to the string object and incref the parent when you refer to parts of its string), but it had the serious disadvantage that the parent string would never be released as long as it had any children. This is quite common. And then we got string interning which basically killed the whole idea. However: isn't it possible to be intelligent about the slicing, i.e. only do the slice if necessary? -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fredrik@pythonware.com Fri May 29 11:06:42 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 29 May 1998 11:06:42 +0100 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib Message-ID: <01bd8ae9$7ab383c0$f29b12c2@panik.pythonware.com> >> | but it has one serious drawback: the string slicing operator >> | copies the string, >> >> Just a thought: should that be changed, since strings are supposed to >> be mutable anyway and this is such a common operation? immutable >This is difficult. I looked into this a few years ago (add a "parent" pointer >to the string object and incref the parent when you refer to parts of its >string), but it had the serious disadvantage that the parent string would >never be released as long as it had any children. This is quite common. And >then we got string interning which basically killed the whole idea. > >However: isn't it possible to be intelligent about the slicing, i.e. only do >the slice if necessary? In the context of saxlib, or generally? How to you define "if necessary"? Cheers /F From larsga@ifi.uio.no Fri May 29 10:50:30 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 29 May 1998 11:50:30 +0200 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib In-Reply-To: References: Message-ID: * Jack Jansen | | However: isn't it possible to be intelligent about the slicing, | i.e. only do the slice if necessary? This is how the characters mehtod is handled in saxlib at the moment: 1) Driver receives data from parser. This is passed on to the application in one of two ways: a) xmlproc: receives string and offsets, passes them on b) the rest: receives string, passes on 0,len(string) offsets 2) Application receives string and offsets. To get at the string it must slice, unless the offsets are 0,len(string). The only way to avoid slicing is for the application to check the offsets it receives (which may or may not require slicing). Since this happens on the application side the solution is a bit difficult to generalize. One way to do it might be to have a document handler base class that implemented the characters method by passing the data on to a simple_chars(self,data) method, performing slicing only for those offsets for which this is necessary. In fact, this can be optimized by checking which parser is used and choosing the right one of two slicing and non-slicing implementations of characters. If people want it such a document handler base class can be made part of saxexts. -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From Jack.Jansen@cwi.nl Fri May 29 12:57:52 1998 From: Jack.Jansen@cwi.nl (Jack Jansen) Date: Fri, 29 May 1998 13:57:52 +0200 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib In-Reply-To: Message by Lars Marius Garshol , 29 May 1998 11:50:30 +0200 , Message-ID: > The only way to avoid slicing is for the application to check the > offsets it receives (which may or may not require slicing). Since this > happens on the application side the solution is a bit difficult to > generalize. > > One way to do it might be to have a document handler base class that > implemented the characters method by passing the data on to a > simple_chars(self,data) method, performing slicing only for those > offsets for which this is necessary. In fact, this can be optimized by > checking which parser is used and choosing the right one of two > slicing and non-slicing implementations of characters. The first is indeed what I meant, and the second is a good implementation of it. I guess I could have been a bit clearer:-) -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@cwi.nl | ++++ if you agree copy these lines to your sig ++++ http://www.cwi.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From fredrik@pythonware.com Fri May 29 14:29:29 1998 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 29 May 1998 14:29:29 +0100 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib Message-ID: <01bd8b05$cea88590$f29b12c2@panik.pythonware.com> (darn. sent this to the wrong mailing list. wonder if I can disable that auto-fill feature of outlook express...) some comments on some of LMG's comments; more will follow later... >| 2. Usability: There's no "feed" method. > def feed(self,data): > "Feeds data to the parser." > raise saxlib.SAXException("Method not supported.",None) don't forget the close method; sgmllib/xmllib/sgmlop uses this to process what's left in the input buffers. and with the accelerated versions of xmllib/sgmllib, you'll leak memory if you don't explicitly call the close method. >| 3. Performance: Is the AttributeList class really necessary? Wouldn't >| it be enough to use a good ole dictionary? >Using dictionaries would mean losing attribute type information. This >is important to be able to identify the different attribute types. DOM >will have to do this and many other kinds of system as well. what exactly can be returned by the getType method? a string describing the type? what values can it have? how should it be used? >Also, the current Python version of AttributeList has the added >>advantage that it can be used like this: >> >>print "<%s" % name >>for attr in attrs: >> print " %s=%s" % (attr,attrs[attr]) for kv in attrs.items(): print "%s=%s" % kv >>Then there's JPython integration and all that. IMHO AttributeList >>should stay. Well, I'm still sceptical... >>Those who want all-out maximum overdrive __raw__ speed above >>all else should not use SAX (or even Python) anyway. Hey, everyone should use Python! >>| 6. Usability: htmllib (!) provides save_bgn and save_end methods >> >>Well, this could be implemented in the extended drivers and might have >>some advantages, but I personally don't really want this feature. Any >>other opinions? Nope. I wont miss them, at least. >>| 7. Should the API be tweaked to adhere to the Python style >>| guidelines? >> >>IMHO: no. It's too late now. I've got lots of code that uses this >>style, you've got code that in it, the DOM has it, the tutorial uses >>it, Paul Prescod has used it and probably many others. OK. >>| 9. Should sgmlop perhaps be renamed to xmlop? >> >>XML is SGML, so if you want to support both I think you should keep >>the name. Well, changing the name might help us cash in on all the XML hype... But you're right, of course; I'll keep the current name. Now back to VML. Cheers /F fredrik@pythonware.com http://www.pythonware.com From larsga@ifi.uio.no Fri May 29 14:18:48 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 29 May 1998 15:18:48 +0200 Subject: [XML-SIG] Re: Mr. Nitpicker looks at saxlib In-Reply-To: <01bd8b00$ad5dabe0$f29b12c2@panik.pythonware.com> References: <01bd8b00$ad5dabe0$f29b12c2@panik.pythonware.com> Message-ID: * Fredrik Lundh | | don't forget the close method; Ah! Thanks. It's included now. | [AttributeList] | | what exactly can be returned by the getType method? a string | describing the type? Yes. | what values can it have? CDATA, ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, NMTOKENS or NOTATION. | how should it be used? To interpret the attribute value. An IDREF is a reference to an ID declared somewhere else in the document. IDREFS is a list of same. ENTITY or ENTITIES really means that the attribute value is the name of an unparsed external entity (or a list in the case of ENTITIES) and that the application should receive its public and system identifiers. The DOM level 2 will be required to keep track of all the IDs in a document so that one can say something like document.getElementWithId("SAX") This will also be necessary in order to support one of the most important parts of XPointers: the id(...) locator term. | for kv in attrs.items(): | print "%s=%s" % kv OK. :-) Support for dictionaries in AttributeList will be extended anyway. Guido (or was it Andrew?) has proposed doing this by making AttributeList be a subclass of UserDict.UserDict. I think the best way around this problem is for you to implement AttributeList in C by subclassing standard dictionaries and then I'll do the same in Python. (Provided it's possible in a sensible fashion, I haven't checked this yet.) Maybe Jack can do this in Pyexpat as well? If you look closely at it (in my translation, I've added a few things) it's basically a slightly modified hash table with some extra methods that are trivial to map onto the hash methods. In your case getType should always return "CDATA" anyway, so it should be easy to implement. | Hey, everyone should use Python! Of course! Pardon my temporary lapse into heresy, please. :) -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/ From larsga@ifi.uio.no Sun May 31 17:39:46 1998 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 31 May 1998 18:39:46 +0200 Subject: [XML-SIG] saxlib finalization Message-ID: Here is my plan for finalizing saxlib: 1. Achieve agreement on interfaces 2. Put out beta with all the new drivers and extensions 3. Receive bug reports (if too_many_errors: goto 2) 4. Put out final version To this end I've written a description of the core SAX interface and put it at If there are no protests, that will be the final core SAX 1.0 interface in Python. As for the extensions, these are the new interfaces I think we should have: - ExtendedParser, as proposed in minus get_stack and with optional feed methods - DispatchDocHandler, which should be a DocumentHandler with an sgmllib-like interface. get_stack and the no-slice method will all go in here. A final question: should these two last extensions be part of saxlib, or should we keep them separate? I really really want feedback on this, and I think it's important that we freeze this spec soon. So everyone, please take a look at this and tell us what you think. Responses that are no more than OK/not OK to the four different issues can be sent to me personally, and I'll summarize to the list. More detailed answers, questions and suggestions for more extensions can be posted here. I say we make the final decision on Monday 8th of June (the deadline is so late because I'll be offline for the 4 days prior to that date), and then I should be able to have a fully documented package out pretty soon. Hopefully we'll start seeing interesting stuff built on top of SAX after that. -- "These are, as I began, cumbersome ways / to kill a man. Simpler, direct, and much more neat / is to see that he is living somewhere in the middle / of the twentieth century, and leave him there." -- Edwin Brock http://www.stud.ifi.uio.no/~larsga/ http://birk105.studby.uio.no/