From Jacco.van.Ossenbruggen@cwi.nl Tue Jun 1 15:02:36 1999 From: Jacco.van.Ossenbruggen@cwi.nl (J.R. van Ossenbruggen) Date: Tue, 01 Jun 1999 16:02:36 +0200 Subject: [XML-SIG] patch to xml/dom/esis_builder.py Message-ID: Hi all, I use the xml package in a mixed SGML/XML environment. I directly process XML but use SP to convert SGML to ESIS first. This works fine, except for two minor modifications I needed to make to esis_builder.py. The first modification prevents a crash on "#IMPLIED" attributes, in which case the current version fails to notice that the third argument (the value of the attr) is missing. The second modification provides an optional argument to EsisBuilder.__init__, which allows me to pass string.lower or string.upper functions to do the necessary case conversions (SGML names are not case sensitive and converted to uppercase by SP, XML names are not changed when converted to ESIS). I include the relevant patch below. I think the changes could be useful for more people, and as far as I know,do not break any existing code. I'd be grateful if they are included in the main distribution. Let me know what you think, Jacco --- Index: esis_builder.py =================================================================== RCS file: /projects/cvsroot/xml/dom/esis_builder.py,v retrieving revision 1.5 diff -c -r1.5 esis_builder.py *** esis_builder.py 1999/03/18 12:38:28 1.5 --- esis_builder.py 1999/06/01 12:09:09 *************** *** 27,37 **** class EsisBuilder(Builder): ! def __init__(self): Builder.__init__(self) self.attr_store = {} self.id_store = {} #self.sdata_handler = handle_sdata def feed(self, data): for line in string.split(data, '\n'): --- 27,39 ---- class EsisBuilder(Builder): ! def __init__(self, convert=lambda x:x): Builder.__init__(self) self.attr_store = {} self.id_store = {} #self.sdata_handler = handle_sdata + # convert may, for example, be used to handle case conversion + self.convert = convert def feed(self, data): for line in string.split(data, '\n'): *************** *** 41,46 **** --- 43,49 ---- text = line[1:] if event == '(': + text = self.convert(text) element = self.document.createElement(text, self.attr_store) self.attr_store = {} self.push(element) *************** *** 50,57 **** elif event == 'A': l = re.split(' ', text, 2) ! name = l[0] ! value = ESISDecode(l[2]) self.attr_store[name] = value elif event == '-': --- 53,64 ---- elif event == 'A': l = re.split(' ', text, 2) ! name = self.convert(l[0]) ! if l[1] == 'IMPLIED': ! # fix this. Needs to be undefined attr ! value = '' ! else: ! value = ESISDecode(l[2]) self.attr_store[name] = value elif event == '-': From Fred L. Drake, Jr." References: Message-ID: <14164.10354.491850.334248@weyr.cnri.reston.va.us> J.R. van Ossenbruggen writes: > I include the relevant patch below. I think the changes could be > useful for more people, and as far as I know,do not break any existing > code. I'd be grateful if they are included in the main distribution. I like this, but have two changes. First, the default convert function could be str instead of a lambda; this would be faster since str() is implemented in C (or Java in JPython). The second change concerns this part of the patch: > ! name = self.convert(l[0]) > ! if l[1] == 'IMPLIED': > ! # fix this. Needs to be undefined attr > ! value = '' > ! else: > ! value = ESISDecode(l[2]) > self.attr_store[name] = value This could be something like this: if l[1] != 'IMPLIED': self.attr_store[self.convert(l[0])] = ESISDecode(l[2]) This does just as much as needed, and doesn't create the bogus attribute entry in the dictionary. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From Fred L. Drake, Jr." In the spirit of making changes to the DOM builder, here's something I've played with a little. When I started working on the conversion of the Python documentation to SGML/XML, I was building DOM objects that weren't legal: the LaTeX files don't map to hierarchical structures, even if you ignore the document preamble. While the documents themselves can be treated as hierachical in this specific case, that's not the case for individual files, which is the level I want to work at. When Andrew fixed the Document class to be less forgiving, I had to change the way I used it, building the more reasonable DocumentFragment objects instead. I'm driving the whole conversion across ESIS streams, so I wanted the ESIS builder to be able to build fragments instead of documents. I've been using a custom subclass that added the needed functionality for this (and some other stuff), but this would probably be very useful for others doing conversion processes. I think the appended patch would be useful for others. It adds a method to xml.dom.builder.Builder called buildFragment(); it has to be called before document construction starts and causes a fragment to be built instead. The fragment can be found as the "fragment" attribute of the builder or the return value of buildFragment(). Does this make sense as part of the base class? Or is this too special a situation? -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives diff -c -r1.10 builder.py *** builder.py 1999/03/18 12:38:28 1.10 --- builder.py 1999/06/01 18:58:29 *************** *** 15,22 **** --- 15,31 ---- def __init__(self): self.document = createDocument() + self.fragment = None + self.target = self.document self.current_element = None + def buildFragment(self): + if self.fragment or len(self.document.childNodes): + raise RuntimeError, \ + "cannot build fragment once document has been started" + self.fragment = self.document.createDocumentFragment() + self.target = self.fragment + return self.fragment def push(self, node): "Add node to current node and move to new node." *************** *** 24,35 **** nodetype = node.get_nodeType() if self.current_element: self.current_element.insertBefore(node, None) ! elif nodetype in _LEGAL_DOCUMENT_CHILDREN: if nodetype == TEXT_NODE: if string.strip(node.get_nodeValue()) != "": ! self.document.appendChild(node) else: ! self.document.appendChild(node) if nodetype == ELEMENT_NODE: self.current_element = node --- 33,44 ---- nodetype = node.get_nodeType() if self.current_element: self.current_element.insertBefore(node, None) ! elif self.fragment or nodetype in _LEGAL_DOCUMENT_CHILDREN: if nodetype == TEXT_NODE: if string.strip(node.get_nodeValue()) != "": ! self.target.appendChild(node) else: ! self.target.appendChild(node) if nodetype == ELEMENT_NODE: self.current_element = node From jkraai@murl.com Wed Jun 2 05:14:09 1999 From: jkraai@murl.com (jkraai) Date: Wed, 02 Jun 1999 04:14:09 +0000 Subject: [XML-SIG] XML -> DTD lib? References: <14164.11654.684418.724401@weyr.cnri.reston.va.us> Message-ID: <3754AF91.38CFDE12@murl.com> Anyone have a DTD generator? I feel like I should not need such a thing, the DTD should have been written and it shouldn't have to be reverse-engineered. What I'd like to do is to give my users the ability to describe a record, then calculate a DTD for that record. This would be a great exercise for me to better understand XML, but if the code already exists ... Thanks for such great code everybody, --jim From danda@netscape.com Wed Jun 2 09:05:13 1999 From: danda@netscape.com (Dan Libby) Date: Wed, 02 Jun 1999 01:05:13 -0700 Subject: [XML-SIG] Re: RSS and stuff Message-ID: <3754E5B9.96A9FD54@netscape.com> Lars, glad to see that others are using the format, even if it is "too simple". ;-) I'm sure you'll be glad to hear that we are doing our validation with python and the excellent XML libraries you all have contributed to. FYI, the current validator is very specific. It understands the "0.9" format intimately at the code level. However, in my spare time I've been working on a generic validator that will read in a schema file (of my own devise, not a real XML schema) that's written in XML, and then validate a document based on that. That way, format changes should be simple to implement, at least from a validation standpoint. Hopefully I can get it installed soon, and possibly even distribute the source, such as it is. (This is my first Python + first DOM coding project). This seems like a pretty obvious thing to me, I'm surprised that XML has gotten as far as it has without real support for enforcing data types, lengths, ranges, etc. > I sat down yesterday and had a look at RSS, a format for news > headlines which is used by Slashdot, mozilla.org and Scripting News, > among others. It was very simple (a bit too simple, in fact), so I sat > down and made a simple RSS library and client in Python. This client > produces a web page when it is run. (I run it from cron.) > What would you like to see / not see in the format? It really is just supposed to be a summary. Ideally, we would like to support all of Dublin Core eventually, but the problem is that the additional data may not actually be used, and marketing folks felt it would be simpler to not confuse folks too much. > The 'specification' and lists of providers can be found at: > > (warning: the RSS guide is not very accurate technically) > What in particular did you find that was inaccurate? I agree it is not very technical, as it is aimed at a pretty general audience, however, it should be pretty accurate. This brings me to another question. Do you all believe it is the "right thing" to publish a DTD for a format, even if the DTD by itself is not sufficient to validate the document? In other words, an XML editor application referencing the DTD would allow the user to construct a document that is non-valid with regards to our rules. It seems to me that the DTD then becomes something of a distraction, because compliance with it, by itself, is not much more useful than well-formedness, from a validation point of view. -dan From larsga@ifi.uio.no Wed Jun 2 09:52:06 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 02 Jun 1999 10:52:06 +0200 Subject: [XML-SIG] Re: RSS and stuff In-Reply-To: <3754E5B9.96A9FD54@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> Message-ID: Hi Dan, * Dan Libby | | I'm sure you'll be glad to hear that we are doing our validation | with python and the excellent XML libraries you all have contributed | to. I certainly am glad to hear that! I'm also glad to see actual live Netscape representatives in a public forum, since I've been wanting to discuss RSS with you. And although I say a lot of negative stuff about RSS below I'd like to congratulate you on the first successful global XML web application. There are so many RSS documents on the web now, and quite a bit of software, so I don't think there's any question that this honour belongs to you. I also don't think there's any question that a major part of the reason is that RSS is so simple. I use my RSS client every day now and am very happy with it. I just wish everyone whose pages I'm interested in would provide RSS feeds, and I will probably start asking for it pretty soon. | FYI, the current validator is very specific. It understands the | "0.9" format intimately at the code level. This is definitely a good idea. Sadly, though, many of the RSS files on the net are not even well-formed. The ones for WebMonkey and python.org spring to mind. | However, in my spare time I've been working on a generic validator | that will read in a schema file (of my own devise, not a real XML | schema) that's written in XML, and then validate a document based on | that. Hmmm. Why not use a real XML schema? It should support everything I can imagine you would want anyway. Or is it too complex? | Hopefully I can get it installed soon, and possibly even distribute | the source, such as it is. (This is my first Python + first DOM | coding project). It would be great if you did. | This seems like a pretty obvious thing to me, I'm surprised that XML | has gotten as far as it has without real support for enforcing data | types, lengths, ranges, etc. I can just hear the functional programming freaks (Standard ML, Haskell and all that) say the same thing about Python. :-) Seriously, these things aren't as important as many people think. And it's also worth remembering that XML comes from a document background where such things are not all that relevant. (Imagine trying to do this for HTML. Actually enforcing correct use of DFN, H1-H6, ABBR, ACRONYM, VAR, ADDRESS and all the other elements would require a serious number of years of AI development in Prolog or Common Lisp.) | What would you like to see / not see in the format? It really is | just supposed to be a summary. The first thing I'd like to see is a date element for items. Many RSS providers currently use something like: (19990602) New foo! ... and it would be useful to formalize that as: 19990602 ... The second thing is descriptions for items. I'm thinking of providing an RSS feed for my home page, and when I do I know I will want to be able to have entries like: <item> <date>19990602</date> <title>RSS feed available! I now provide an RSS feed which lists all updates to my home page. This will hopefully make it easier for people A third thing is a place to put the email address of the maintainer so that I know where to complain when a document isn't well-formed. There's probably more as well, which I'll think of the moment I send this. If you want discussion about what RSS should and shouldn't contain I'd recommend you to try to start it here or over at xml-dev. (I know Dave Winer has a lot of ideas for it | Ideally, we would like to support all of Dublin Core eventually, but | the problem is that the additional data may not actually be used, | and marketing folks felt it would be simpler to not confuse folks | too much. I came to pretty much the same conclusion with XSA (see below) and then discovered that the difficult stuff was needed anyway. But I still think this is the right way to go: - make a simple version and put it out - wait for widespread acceptance and lots of implementations - then add all the difficult stuff and make it optional (In your case: why not make a CGI wizard like I did with XSA, and add a link from the RSS guide to the more fancy options?) In any case, this isn't a new idea, since this is exactly what C, Unix and C++ have done (to some extent also SAX and XML) and it seems to work better than the opposite approach, favoured by many little-known technologies (such as SGML). * Lars Marius Garshol | | (warning: the RSS guide is not very accurate technically) * Dan Libby | | What in particular did you find that was inaccurate? Here's a quick list: - The guide says: "Name your file using the .rdf suffix, unless you are generating your file dynamically using a .cgi or other program. Netscape recommends the use of the .rdf filename suffix, but does not require it." Well, on the web it's the MIME type that counts, so the guide should give the correct MIME type and then some hints on how to get it right. The suffix is just an ugly trick to get the right MIME type on correctly configured servers. - "RSS 0.9 supports the full ASCII character set, as well as all legal decimal and HTML entities. RSS 0.9 does not support other types of character data, such as UTF-8. For a list of legal HTML and decimal entities, refer to Special Symbols and Entities on DevEdge, Netscape's information resource for developers." Well, XML uses Unicode, but I suppose applications can be more restrictive. However, you cannot use HTML entities in XML without declaring them, and since there is no RSS DTD any RSS file that uses an HTML entity is not well-formed. - '' If you use US-ASCII you might as well declare that you're doing so with an encoding declaration. (Parsers may then complain if you don't conform to your own declaration.) - Also, what's the relationship with RDF? RSS uses the RDF root element, but does not conform to the RDF syntax or actually use anything meaningful from RDF. | This brings me to another question. Do you all believe it is the | "right thing" to publish a DTD for a format, even if the DTD by | itself is not sufficient to validate the document? Yes! A DTD is useful in that it allows you to do at least some validation, and it's also very useful as a statement of intent (that is, as documentation). For example, when reading the RSS guide it's impossible to tell whether one or more textinput elements are allowed and where they are allowed. The same goes for the image element. This is the RSS DTD I currently have in my CVS tree. However, I have no idea whether it's correct or not. For example, I've seen userland.com use the image element as a special kind of item, so maybe the rdf:RDF element should have (channel, (image | item)+, textinput?). | In other words, an XML editor application referencing the DTD would | allow the user to construct a document that is non-valid with | regards to our rules. It seems to me that the DTD then becomes | something of a distraction, because compliance with it, by itself, | is not much more useful than well-formedness, from a validation | point of view. It's useful in that it provides more information for content providers and software developers, and in that it's 100% unambiguous. It's also useful for you when doing validation with custom-written tools, since you won't have to worry about where elements occur. I've done exactly the same for XSA and have exactly the same problem as you. I provided a DTD and have special validating software that rides on top of a validator (xmlproc). If I were to do it again there's no question that I would do the same thing. So far there has been no confusion at all (although I've seen HTML users become confused by this). See for more info. --Lars M. From michel.plu@cnet.francetelecom.fr Wed Jun 2 10:06:56 1999 From: michel.plu@cnet.francetelecom.fr (PLU Michel CNET/DSM/LAN) Date: Wed, 2 Jun 1999 11:06:56 +0200 Subject: [XML-SIG] accessing to xml doc element Message-ID: Is there a python xml code for accessing element of an HTML or XML document in a way like myDocument.documentElement.body[0].center[0].table[2].tr[0].td[0].table[1].t r[0].td[1].font[0].h4[0].pcdata[0] the idea is to use the dom interface of a document and to define the __getattr__ (tag) method of the Node class in the xml.dom.core module in order to calls the method getElementByTagName(tag). Unfortunetly this method is already defined and personally modifying it is not a clean solution any idea ? Michel Ps: Please reply me directly since i did not subscribe to the mailing list From larsga@ifi.uio.no Wed Jun 2 10:57:32 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 02 Jun 1999 11:57:32 +0200 Subject: [XML-SIG] accessing to xml doc element In-Reply-To: References: Message-ID: * PLU Michel | | Is there a python xml code for accessing element of an HTML or XML document | in a way like | | myDocument.documentElement.body[0].center[0].table[2].tr[0].td[0].table[1].t | r[0].td[1].font[0].h4[0].pcdata[0] XPointers can do this. My PyPointers does this, but is currently not updated to the latest PyDOM version. However, 4DOM comes with a version that works with 4DOM. I plan to update PyPointers, but with SAX2, easySAX, JPython SAX and my thesis hanging over me there isn't that much time for it... --Lars M. From Jacco.van.Ossenbruggen@cwi.nl Wed Jun 2 15:05:02 1999 From: Jacco.van.Ossenbruggen@cwi.nl (J.R. van Ossenbruggen) Date: Wed, 02 Jun 1999 16:05:02 +0200 Subject: [XML-SIG] patch to xml/dom/esis_builder.py In-Reply-To: Your message of "Tue, 01 Jun 1999 14:37:38 MET DST." <14164.10354.491850.334248@weyr.cnri.reston.va.us> Message-ID: On Tue, Jun 1 1999 "Fred L. Drake" wrote: > I like this, but have two changes. First, the default convert > function could be str instead of a lambda; this would be faster since > str() is implemented in C (or Java in JPython). Agreed. > The second change concerns this part of the patch: > > > ! name = self.convert(l[0]) > > ! if l[1] == 'IMPLIED': > > ! # fix this. Needs to be undefined attr > > ! value = '' > > ! else: > > ! value = ESISDecode(l[2]) > > self.attr_store[name] = value > > This could be something like this: > > if l[1] != 'IMPLIED': > self.attr_store[self.convert(l[0])] = ESISDecode(l[2]) > > This does just as much as needed, and doesn't create the bogus > attribute entry in the dictionary. You're right again. I was under the impression #IMPLIED attributes should create a bogus attribute with specified=false. I just reread the spec to see that this impression was false. Thanks a lot! Jacco PS: a new version of the patch with Fred's changes: Index: esis_builder.py =================================================================== RCS file: /projects/cvsroot/xml/dom/esis_builder.py,v retrieving revision 1.5 diff -c -r1.5 esis_builder.py *** esis_builder.py 1999/03/18 12:38:28 1.5 --- esis_builder.py 1999/06/02 13:04:26 *************** *** 27,37 **** class EsisBuilder(Builder): ! def __init__(self): Builder.__init__(self) self.attr_store = {} self.id_store = {} #self.sdata_handler = handle_sdata def feed(self, data): for line in string.split(data, '\n'): --- 27,39 ---- class EsisBuilder(Builder): ! def __init__(self, convert=str): Builder.__init__(self) self.attr_store = {} self.id_store = {} #self.sdata_handler = handle_sdata + # convert may, for example, be used to handle case conversion + self.convert = convert def feed(self, data): for line in string.split(data, '\n'): *************** *** 41,46 **** --- 43,49 ---- text = line[1:] if event == '(': + text = self.convert(text) element = self.document.createElement(text, self.attr_store) self.attr_store = {} self.push(element) *************** *** 50,58 **** elif event == 'A': l = re.split(' ', text, 2) ! name = l[0] ! value = ESISDecode(l[2]) ! self.attr_store[name] = value elif event == '-': text = self.document.createText(ESISDecode(text)) --- 53,61 ---- elif event == 'A': l = re.split(' ', text, 2) ! name = self.convert(l[0]) ! if l[1] != 'IMPLIED': ! self.attr_store[self.convert(l[0])] = ESISDecode(l[2]) elif event == '-': text = self.document.createText(ESISDecode(text)) From danda@netscape.com Thu Jun 3 00:14:23 1999 From: danda@netscape.com (Dan Libby) Date: Wed, 02 Jun 1999 16:14:23 -0700 Subject: [XML-SIG] Re: RSS and stuff References: <3754E5B9.96A9FD54@netscape.com> Message-ID: <3755BACF.BDC54B15@netscape.com> This is a multi-part message in MIME format. --------------1E252B7D3F134ACBF841820B Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lars, Thanks for your response. I have forwarded it to others here who are involved with RSS. Below are my responses. > This is definitely a good idea. Sadly, though, many of the RSS files > on the net are not even well-formed. The ones for WebMonkey and > python.org spring to mind. > I assume you mean they are not well-formed because they embed entities? > | However, in my spare time I've been working on a generic validator > | that will read in a schema file (of my own devise, not a real XML > | schema) that's written in XML, and then validate a document based on > | that. > > Hmmm. Why not use a real XML schema? It should support everything I > can imagine you would want anyway. Or is it too complex? 1) It's a spec. A very complex spec. I don't know of any software that implements it. I don't have time to write such software, given our development schedules which are measured in days. I just want something that is flexible enough that we can change our format without having to write a bunch of new code. When XML schemas are well supported, then we should be able to move to those quite easily, provided they have a superset of our functionality. Besides, if I tried, I would probably end up with something that is close to XML schemas, but not exact, so then we have unexpected behavior, etc. This way, it is obviously not an xml schema, just "Dan's validation rules" DTD. 2) I may have just missed it, but I didn't see any support for limiting length of strings. 3) The time support is IS0 8601 only, which is itself a very complicated subject. (aside: anyone know of a python module to parse dates according to 8601?). I would like to see support for unix/c style integer timestamps (seconds since 1970 UNC, as returned by time() ). We tend to use these a lot. Also for unix/c style date string as returned by `date`. eg: Sun May 30 19:24:15 PDT 1999. I already forwarded this request to the xml schema folks. > Seriously, these things aren't as important as many people think. And > it's also worth remembering that XML comes from a document background > where such things are not all that relevant. (Imagine trying to do > this for HTML. Actually enforcing correct use of DFN, H1-H6, ABBR, > ACRONYM, VAR, ADDRESS and all the other elements would require a > serious number of years of AI development in Prolog or Common Lisp.) > They are important to us. We need to store this stuff in a database. We need to make sure some joker hasn't given us a string that is 20 megabytes long, and further that we won't be putting HTML into our generated page that breaks the entire page. We also need to be able to tell end-users (webmasters) whether the data they have given us will actually be displayed correctly or not. I think that as XML becomes used for data transfer, as opposed to document transfer, people will be more and more concerned about this. E-commerce especially is going to require a very specific set of enforceable rules for validity. For some reason, people tend to become very upset when money is involved. ;-) > > | What would you like to see / not see in the format? It really is > | just supposed to be a summary. > > The first thing I'd like to see is a date element for items. Many RSS > providers currently use something like: > > > (19990602) New foo! > ... > > and it would be useful to formalize that as: > > > 19990602 > ... > Agreed. I had this in the original spec, but was removed for public release, since we were not actually going to use the value. What do you think of <timestamp> (seconds since 1970) </timestamp> instead? Again, I'm not fond of parsing IS0 8601. > The second thing is descriptions for items. I'm thinking of providing > an RSS feed for my home page, and when I do I know I will want to be > able to have entries like: > > <item> > <date>19990602</date> > <title>RSS feed available! > I now provide an RSS feed which lists all updates to > my home page. This will hopefully make it easier for people > This should be possible. Again, we didn't support stuff like this originally, because will not actually use the data in the "description" tag anywhere on My Netscape, and because our (old) validator code had to know about description rules for each location it is used. As others are now using the format, I can see where it would make sense, and it should be easy to add this as an optional element if I can convince people to use my new validation code. > A third thing is a place to put the email address of the maintainer so > that I know where to complain when a document isn't well-formed. > hmm. I assume you think this should be inside the tag? This is where would be nice... > - "RSS 0.9 supports the full ASCII character set, as well as all > legal decimal and HTML entities. RSS 0.9 does not support other > types of character data, such as UTF-8. For a list of legal HTML and > decimal entities, refer to Special Symbols and Entities on DevEdge, > Netscape's information resource for developers." > We are updating this to support UTF-8 soon, and possibly other encodings. I promise to post a DTD soon. ;-) > - Also, what's the relationship with RDF? RSS uses the RDF root > element, but does not conform to the RDF syntax or actually use > anything meaningful from RDF. This boils down to internal politics. If you click on the "Future Directions" link in the quickstart (http://my.netscape.com/publish/help/futures.html), I have an example of the original RSS format I came up with, which does make meaningful use of RDF (channels have IDs, all nodes connect, dublin core is used, etc.) However, apparently this "overly complicated". There are other technical reasons I can't really go into. Anyway, for now, RSS is basically an XML format, and it may eventually have an RDF superset. [regarding posted RSS DTD] Thanks. I'll take a look at this, run it through a validating parser, etc. Do you mind if we post it, or a slightly modified version, as the "official" DTD? > This implies ordering, correct? ie, title, then description, then link? A problem I had with DTDs is that I couldn't figure out how to say that an element is required, and that ordering is unimportant. Therefore, if I posted this DTD now, it would mean that a whole bunch of existing channels are invalid. The other option is to use (title | description | link), but this means that they are optional, which is even less correct. > I've done exactly the same for XSA and have exactly the same problem > as you. I provided a DTD and have special validating software that > rides on top of a validator (xmlproc). If I were to do it again > there's no question that I would do the same thing. So far there has > been no confusion at all (although I've seen HTML users become > confused by this). What is this special validating software? Is it generic, or does it know specifically about your format? If generic, what do you use as input to define the validaton rules? My apologies if this is all explained in detail somwhere... ;-) -dan --------------1E252B7D3F134ACBF841820B Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------1E252B7D3F134ACBF841820B-- From wunder@infoseek.com Thu Jun 3 01:04:20 1999 From: wunder@infoseek.com (Walter Underwood) Date: Wed, 02 Jun 1999 17:04:20 -0700 Subject: [XML-SIG] Re: RSS and stuff In-Reply-To: <3755BACF.BDC54B15@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> Message-ID: <3.0.5.32.19990602170420.00a96990@corp> At 04:14 PM 6/2/99 -0700, Dan Libby wrote: > >3) The time support is IS0 8601 only, which is itself a very complicated subject. >(aside: anyone know of a python module to parse dates according to 8601?). I >would like to see support for unix/c style integer timestamps (seconds since 1970 >UNC, as returned by time() ). It's not that bad. Insist on the web profile of ISO 8601 and there are only five formats. Do an sscanf or re.match for each format, and when one converts, do time.mktime() with what you've just parsed. And let's try to avoid using seconds-since-the-epoch in external formats. We're just now doing the Y2K thing, so I don't think it is a good idea to use formats that fall apart in 2037. Here is what it takes to convert a Unix timestamp to ISO 8601: def make_date(timeint): return time.strftime('%Y-%m-%dT%H:%M:%SZ',time.gmtime(timeint)) wunder -- Walter R. Underwood wunder@infoseek.com wunder@best.com (home) http://software.infoseek.com/cce/ (my product) http://www.best.com/~wunder/ 1-408-543-6946 From danda@netscape.com Thu Jun 3 02:04:22 1999 From: danda@netscape.com (Dan Libby) Date: Wed, 02 Jun 1999 18:04:22 -0700 Subject: [XML-SIG] Re: RSS and stuff References: <3754E5B9.96A9FD54@netscape.com> <3.0.5.32.19990602170420.00a96990@corp> Message-ID: <3755D495.112CA66F@netscape.com> This is a multi-part message in MIME format. --------------630938801F57E97C1046FD98 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Walter, thanks for the code example. > And let's try to avoid using seconds-since-the-epoch in external > formats. We're just now doing the Y2K thing, so I don't think it > is a good idea to use formats that fall apart in 2037. I thought it was 2038. ;-) Seems like we should all be using long longs by then - greater than 32 bits anyway, so I'm not sure it is such a big problem. Anyway, the nice thing about the integer is that they are guaranteed accurate to the second. With ISO 8601, the receiver needs to round (nearest day, hour, minute, second). Besides, if unix breaks then, people are gonna have bigger worries than RSS displaying 1970. > Here is what it takes to convert a Unix timestamp to ISO 8601: > > def make_date(timeint): > return time.strftime('%Y-%m-%dT%H:%M:%SZ',time.gmtime(timeint)) > Right, but my thinking is that it is easier for people if we just support it natively than if they have to figure out how to do that in their sed script or whatever. --------------630938801F57E97C1046FD98 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------630938801F57E97C1046FD98-- From akuchlin@mems-exchange.org Thu Jun 3 14:14:37 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Thu, 3 Jun 1999 09:14:37 -0400 (EDT) Subject: [XML-SIG] Re: RSS and stuff In-Reply-To: <3755BACF.BDC54B15@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> Message-ID: <14166.32701.896825.371306@amarok.cnri.reston.va.us> Dan Libby writes: >(aside: anyone know of a python module to parse dates according to 8601?). The XML package contains xml.utils.iso8601.py, contributed by Fred Drake. -- A.M. Kuchling http://starship.python.net/crew/amk/ America is a country that doesn't know where it is going but is determined to set a speed record getting there. -- Laurence J. Peter From wunder@infoseek.com Fri Jun 4 16:49:44 1999 From: wunder@infoseek.com (Walter Underwood) Date: Fri, 04 Jun 1999 08:49:44 -0700 Subject: [XML-SIG] Re: RSS and stuff In-Reply-To: <3755D495.112CA66F@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <3.0.5.32.19990602170420.00a96990@corp> Message-ID: <3.0.5.32.19990604084944.00aa0340@corp> At 06:04 PM 6/2/99 -0700, Dan Libby wrote: >Walter, thanks for the code example. > >> And let's try to avoid using seconds-since-the-epoch in external >> formats. We're just now doing the Y2K thing, so I don't think it >> is a good idea to use formats that fall apart in 2037. > >I thought it was 2038. ;-) Seems like we should all be using >long longs by then - greater than 32 bits anyway, so I'm not sure >it is such a big problem. We've been parsing dates for date search in our engine, and the Unix timestamp has real problems. No time zone, for example. >Anyway, the nice thing about the integer is that they are guaranteed >accurate to the second. With ISO 8601, the receiver needs to round >(nearest day, hour, minute, second). With the timestamp, does the number of seconds include all the leap seconds since 1970? It should, but does it? Does Apache on Amiga do the right thing? To be pedantic, the Unix timestamp format is precise but may not be accurate. Lots of content has a meaningful precision other than one second. Press Releases are on a certain day. Books are published in a particular month. Forcing meaningless precision on those things is a mistake. Finally, the seconds thing totally falls apart if you need to express dates outside it's tiny range: photograph taken in 1893, an HP atomic clock app note written in 1964, etc. Internally, the right way to handle this is to carry a precision along with the time. DCE has some routines to do this. The DCE Time Services Spec is listed here, but it's not free: http://www.opengroup.org/public/pubs/catalog/c310.htm I'll see if I can hunt down some non-pay man pages. wunder -- Walter R. Underwood wunder@infoseek.com wunder@best.com (home) http://software.infoseek.com/cce/ (my product) http://www.best.com/~wunder/ 1-408-543-6946 From danda@netscape.com Fri Jun 4 22:32:23 1999 From: danda@netscape.com (Dan Libby) Date: Fri, 04 Jun 1999 14:32:23 -0700 Subject: [XML-SIG] Re: RSS and stuff References: <3754E5B9.96A9FD54@netscape.com> <3.0.5.32.19990602170420.00a96990@corp> <3.0.5.32.19990604084944.00aa0340@corp> Message-ID: <375845E7.3D85145@netscape.com> This is a multi-part message in MIME format. --------------EE201692D10CF2927E688DF2 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > We've been parsing dates for date search in our engine, and the > Unix timestamp has real problems. No time zone, for example. > It is understood to be in UNC. If you need to convert, you do so with localtime() or equivalent. > Lots of content has a meaningful precision other than one second. > Press Releases are on a certain day. Books are published in a > particular month. Forcing meaningless precision on those things > is a mistake. In general, I prefer "too much" precision to too little. For example, if we need to display a timestamp for when this article was created in a consistent notation, we may include all the way down to the minute. If they have given us something like "June 1999", it places the onus on we, the receiver to round to the nearest day, hour, minute, second. The timestamp method places it on the sender, who should know more accurately. > Finally, the seconds thing totally falls apart if you need to express > dates outside it's tiny range: photograph taken in 1893, an HP atomic > clock app note written in 1964, etc. True.... but not many web pages were created before 1970, and this format is supposed to be describing web pages. (Site Summary) What do you think about this for a compromise, two different tags: ISO 6501 seconds since 1970, UNC Or alternatively: ... ... --------------EE201692D10CF2927E688DF2 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------EE201692D10CF2927E688DF2-- From larsga@ifi.uio.no Sat Jun 5 11:50:00 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 05 Jun 1999 12:50:00 +0200 Subject: [XML-SIG] Re: RSS and stuff In-Reply-To: <3755BACF.BDC54B15@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> Message-ID: * Lars Marius Garshol | | Sadly, though, many of the RSS files on the net are not even | well-formed. The ones for WebMonkey and python.org spring to mind. * Dan Libby | | I assume you mean they are not well-formed because they embed | entities? Actually, no. The WebMonkey file is not well-formed because the XML declaration does not begin the document (if they removed it all would be well; I've emailed them, but to no avail) and the python.org one is not well-formed because it has a ... pair. * Lars Marius Garshol | | Hmmm. Why not use a real XML schema? It should support everything I | can imagine you would want anyway. Or is it too complex? * Dan Libby | | 1) It's a spec. A very complex spec. I don't know of any software | that implements it. I don't have time to write such software, given | our development schedules which are measured in days. I just want | something that is flexible enough that we can change our format | without having to write a bunch of new code. When XML schemas are | well supported, then we should be able to move to those quite | easily, provided they have a superset of our functionality. | Besides, if I tried, I would probably end up with something that is | close to XML schemas, but not exact, so then we have unexpected | behavior, etc. This way, it is obviously not an xml schema, just | "Dan's validation rules" DTD. | 2) I may have just missed it, but I didn't see any support for | limiting length of strings. I don't think there is any. | 3) The time support is IS0 8601 only, which is itself a very | complicated subject. Walter Underwood and AMK have already dealt with this, so I'll just skip it here. * Lars Marius Garshol | | [on the topic of XML and data typing] | | Seriously, these things aren't as important as many people think. | And it's also worth remembering that XML comes from a document | background where such things are not all that relevant. * Dan Libby | | They are important to us. We need to store this stuff in a database. | We need to make sure some joker hasn't given us a string that is 20 | megabytes long, Sure, but in the original SGML context this wasn't a problem in the same way. | I think that as XML becomes used for data transfer, as opposed to | document transfer, people will be more and more concerned about | this. E-commerce especially is going to require a very specific set | of enforceable rules for validity. Definitely, and for this very reason I've been advocating that the W3C schema language should be extensible, so that the e-commerce and EDI communities (and other communities with special needs) can build on what's already defined. | For some reason, people tend to become very upset when money is | involved. ;-) Strange. Can't think why that would be. :) | [dates in RSS] | | Agreed. I had this in the original spec, but was removed for public | release, since we were not actually going to use the value. What do | you think of (seconds since 1970) instead? I don't like it. Most people will be authoring RSS by hand or generate it automatically from some hand-written source. When writing RSS by hand seconds since 1970 is out of the question and when generating it with XSL I don't think this transformation is possible. Also, seconds since 1970 is not human-readable or intuitive in any way. | Again, I'm not fond of parsing IS0 8601. A simple requirement like YYYYMMDD would be sufficient, I think. (Even not requiring anything at all should be acceptable, but in this case YYYYMMDD might be the best choice.) | [item descriptions in RSS] | | This should be possible. Again, we didn't support stuff like this | originally, because will not actually use the data in the | "description" tag anywhere on My Netscape, and because our (old) | validator code had to know about description rules for each location | it is used. As others are now using the format, I can see where it | would make sense, and it should be easy to add this as an optional | element if I can convince people to use my new validation code. Good! I'm crossing my fingers here. :) * Lars Marius Garshol | | A third thing is a place to put the email address of the maintainer so | that I know where to complain when a document isn't well-formed. * Dan Libby | | hmm. I assume you think this should be inside the tag? Yes. | This is where would be nice... Ouch, no. , perhaps. Dublin Core doesn't mandate the syntax of DC element contents, but using the email address here doesn't feel very right. Also: one thing I detest about this use of namespaces is that it gives you no choice in naming (except in the prefix, which I don't think should be abused). Something like: would be much better. * Lars Marius Garshol | | - "RSS 0.9 supports the full ASCII character set, as well as all | legal decimal and HTML entities. RSS 0.9 does not support other | types of character data, such as UTF-8. For a list of legal HTML and | decimal entities, refer to Special Symbols and Entities on DevEdge, | Netscape's information resource for developers." * Dan Libby | | We are updating this to support UTF-8 soon, and possibly other | encodings. Hmmm. Which parser(s) are you using? | I promise to post a DTD soon. ;-) Good. :) | [RSS and RDF] | | If you click on the "Future Directions" link in the quickstart | (http://my.netscape.com/publish/help/futures.html), I have an | example of the original RSS format I came up with, which does make | meaningful use of RDF (channels have IDs, all nodes connect, dublin | core is used, etc.) Hmmm. Maybe there's something about RDF I've missed, but this doesn't appear to be correct RDF either. Shouldn't the RDF document be just a sequence of RDF statements, with custom elements inside the statements? | However, apparently this "overly complicated". I think that's correct. Do you think this proposal would have caught on the way RSS 0.9 has? (Sometimes I think we should all re-read worse-is-better every morning. :) | [regarding posted RSS DTD] | | Thanks. I'll take a look at this, run it through a validating | parser, etc. Do you mind if we post it, or a slightly modified | version, as the "official" DTD? Not at all. Does this mean that I captured your view of RSS correctly? * Lars Marius Garshol | | * Dan Libby | | This implies ordering, correct? ie, title, then description, then | link? Yes. | A problem I had with DTDs is that I couldn't figure out how to say | that an element is required, and that ordering is unimportant. In XML there isn't any. Schemas currently allow this, as do SGML DTDs. You can do it by explicitly allowing choices between all the possible different sequences, but for n elements the number of sequences equals n factorial. | Therefore, if I posted this DTD now, it would mean that a whole | bunch of existing channels are invalid. Ouch. Not good. However, why did you allow any ordering? If the order doesn't matter it may as well be fixed, especially as this causes much less pain in specifying a DTD. I don't see the harm anywhere either. | The other option is to use (title | description | link), but this | means that they are optional, which is even less correct. I agree, this is an ugly problem, but it's mainly caused by being insufficently restrictive to begin with. | [XSA custom validator] | | What is this special validating software? Is it generic, or does it | know specifically about your format? If generic, what do you use as | input to define the validaton rules? I use a DTD as a declarative means of specifying the hard bits (allowed elements and nesting), and then Python code to deal with element content typing. (This is not generic at the moment. After reading the XML schemas draft I'm working on an implementation of the data types part which would be completely generic and not even depend on schemas.) Since the DTD handles everything except element content this works well and is really easy. Also, the DTD works well as documentation and people can also use it to guide XML-aware editors and so on. | My apologies if this is all explained in detail somwhere... ;-) It's not. :) --Lars M. From danda@netscape.com Sat Jun 5 22:48:30 1999 From: danda@netscape.com (Dan Libby) Date: Sat, 05 Jun 1999 14:48:30 -0700 Subject: [XML-SIG] Re: RSS and stuff References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> Message-ID: <37599B2E.A238ED59@netscape.com> This is a multi-part message in MIME format. --------------E13B6852273E859CE3A7C91D Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > * Dan Libby > | > | They are important to us. We need to store this stuff in a database. > | We need to make sure some joker hasn't given us a string that is 20 > | megabytes long, > > Sure, but in the original SGML context this wasn't a problem in the > same way. > I'm not sure what you mean here. Why wasn't it a problem? Probably because people were using SGML to transfer "documents", rather than "data", and possibly because the publishers were always trusted? > Definitely, and for this very reason I've been advocating that the W3C > schema language should be extensible, so that the e-commerce and EDI > communities (and other communities with special needs) can build on > what's already defined. Yes! > | This is where would be nice... > > Ouch, no. , perhaps. Dublin Core doesn't mandate the > syntax of DC element contents, but using the email address here > doesn't feel very right. > Well, if used correctly in an RDF context, would just be an arc-label that refers to a node that represents you. That node would have other arc-labels named eg: email-address, first name, last name, country, etc. > | We are updating this to support UTF-8 soon, and possibly other > | encodings. > > Hmmm. Which parser(s) are you using? > Errg. xmlproc. (XMLValParserFactory.make_parser()). I've been talking to Jose, our i18n guy, and it sounds like Python is not internally UTF-8 compliant, but he isn't concerned for some reason... > | If you click on the "Future Directions" link in the quickstart > | (http://my.netscape.com/publish/help/futures.html), I have an > | example of the original RSS format I came up with, which does make > | meaningful use of RDF (channels have IDs, all nodes connect, dublin > | core is used, etc.) > > Hmmm. Maybe there's something about RDF I've missed, but this doesn't > appear to be correct RDF either. Shouldn't the RDF document be just a > sequence of RDF statements, with custom elements inside the statements? > RDF is about a directed labelled graph. As long as you comply with that data model, the actual name of the elements (vocabulary) is secondary. Check out some of the .rdf's at mozilla.org. It looks similar. Also, I validated it with SirPac (http://www.w3.org/RDF/Implementations/SiRPAC/) and with our chief rdf guru, guha (whose name appears on xml-schema docs, etc). (Note: the version of Sirpac currently installed has a bug that causes the visualized graph to be disconnected) > Not at all. Does this mean that I captured your view of RSS correctly? Well, close. I removed the ordering dependencies you had, and also added support for HTML 3.2 entities. I'll post a draft soon. -dan --------------E13B6852273E859CE3A7C91D Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------E13B6852273E859CE3A7C91D-- From danda@netscape.com Mon Jun 7 04:10:57 1999 From: danda@netscape.com (Dan Libby) Date: Sun, 06 Jun 1999 20:10:57 -0700 Subject: [XML-SIG] xmlproc, dtd's, and such References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> Message-ID: <375B3840.68F36912@netscape.com> This is a multi-part message in MIME format. --------------E992E119899C2F89F9E0AF08 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Okay, so I'm using xmlproc for some DTD based validation. However, I don't want to go off to the network every time I have to validate a new file, which means I will have to cache locally somehow. I saw an earlier thread on this topic which seemed to indicate that this should be easy, but it didn't actually elaborate. Can anyone tell me specifically what class/methods to override? One approach I might imagine is that the parser would call some sort of openDTD() function that I could override. In there, I would have a map from the public DTD url to a local file. Alternatively, there could already be some pre-built caching code. Further, ideally, I would like to do all this through the sax interface in a non parser specific manner. Is that asking too much? -dan --------------E992E119899C2F89F9E0AF08 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------E992E119899C2F89F9E0AF08-- From larsga@ifi.uio.no Mon Jun 7 06:57:11 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 07 Jun 1999 07:57:11 +0200 Subject: [XML-SIG] xmlproc, dtd's, and such In-Reply-To: <375B3840.68F36912@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> Message-ID: * Dan Libby | | Okay, so I'm using xmlproc for some DTD based validation. However, | I don't want to go off to the network every time I have to validate | a new file, which means I will have to cache locally somehow. Basically, what determines where xmlproc will look for the DTD is the document itself and the public and system identifiers in the DOCTYPE declaration. If you set those correctly, xmlproc will look for the DTD where you want. You can also use a catalog file to control the resolution of the public identifier, but in SAX 1.0 there is no standard way to give the parser a pointer to the catalog file. If you don't trust the system and public identifiers and want to control this yourself you can use the EntityResolver interface. Here's an example: from xml.sax import saxexts class EntityResolver: def resolveEntity(self, publicId, systemId): print "PUBID: "+`publicId`+"\tSYSID: "+`systemId` return systemId parser=saxexts.make_parser("xml.sax.drivers.drv_xmlproc_val") parser.setEntityResolver(EntityResolver()) parser.parse("test.xml") The first call to resolveEntity will be for the external DTD subset and if you want to control where that is read from, just return the system identifier you want to use. (If you want to use a catalog file in a standard way at the moment, this is how. xmlproc comes with a SAX EntityResolver which reads and uses a catalog file.) I hope this helped, --Lars M. From larsga@ifi.uio.no Mon Jun 7 07:06:56 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 07 Jun 1999 08:06:56 +0200 Subject: [XML-SIG] Re: RSS and stuff In-Reply-To: <37599B2E.A238ED59@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> Message-ID: * Dan Libby | | I'm not sure what you mean here. Why wasn't it a problem? Probably | because people were using SGML to transfer "documents", rather than | "data", and possibly because the publishers were always trusted? Both reasons applied, yes. Most SGML applications were in-house, and so employees were usually trusted not to play dirty tricks, although one did want to check for mistakes. * Lars Marius Garshol | | Ouch, no. , perhaps. Dublin Core doesn't mandate the | syntax of DC element contents, but using the email address here | doesn't feel very right. * Dan Libby | | Well, if used correctly in an RDF context, would just | be an arc-label that refers to a node that represents you. That | node would have other arc-labels named eg: email-address, first | name, last name, country, etc. In an RDF context it would be different, but I fear you're giving up on RSS being easy to author, support and understand then. And I still don't really like that way of reusing the semantics of the Dublin Core creator element. | Errg. xmlproc. (XMLValParserFactory.make_parser()). I've been | talking to Jose, our i18n guy, and it sounds like Python is not | internally UTF-8 compliant, but he isn't concerned for some | reason... Well, xmlproc should parse UTF-8 files just fine at the moment, as long as you don't use characters above 127 in names, name tokens or character references. Your application can then view the strings it gets from xmlproc as byte arrays and simply do its own Unicode handling to whatever extent is needed. (You can even handle character references yourself by overriding a method in xmlproc and passing UTF-8-encoded characters to the application.) --Lars M. From danda@netscape.com Tue Jun 8 05:17:44 1999 From: danda@netscape.com (Dan Libby) Date: Mon, 07 Jun 1999 21:17:44 -0700 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc References: <3754E5B9.96A9FD54@netscape.com> Message-ID: <375C9968.3574D93@netscape.com> This is a multi-part message in MIME format. --------------9D9B38E890EEB29BE54CC4B6 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit 1) The version of xmlproc I have does not appear to support any encoding other than "iso-8859-1". It returns an error for any other value. Before, when we were using xmllib, it simply called handle_xml(), where we were able to look at the encoding value and make appropriate decisions at the application level. Does xmlproc have any equivalent functionality, perhaps in a more recent version? 2) Explanation: I need to preserve XML/HTML entities. For example, if the document contains % then I want to print that out exactly, not the parsed/converted value. If I don't do this, then any random person can embed html markup, etc, which could break an HTML page. This was pretty easy using xmllib - a non-validating parser, because it simply calls my handler for all entities it encounters and I can provide the mapping. However, with xmlproc, it doesn't seem to call any callback that I can find, it simply looks up the entity in its map and returns it, or else spits out an error 3021: Undeclared Entity. So my question is: Is there a suggested workaround? I suppose I could always pre-process the document before giving it to the parser, but that seems pretty messy. -dan --------------9D9B38E890EEB29BE54CC4B6 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------9D9B38E890EEB29BE54CC4B6-- From larsga@ifi.uio.no Tue Jun 8 07:15:07 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 08 Jun 1999 08:15:07 +0200 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc In-Reply-To: <375C9968.3574D93@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> Message-ID: * Dan Libby | | 1) The version of xmlproc I have does not appear to support any | encoding other than "iso-8859-1". This is correct. (Well, US-ASCII will also work, as will anything else that is based on US-ASCII as long as you don't try to use funny characters in names or name tokens.) | [...] Before, when we were using xmllib, it simply called | handle_xml(), where we were able to look at the encoding value and | make appropriate decisions at the application level. Does xmlproc | have any equivalent functionality, perhaps in a more recent version? The functionality is there, but not used at the moment. If you look at the charconv module you'll see that it contains conversion code for various encodings as well as registry object for converters. If you want I can easily add the hooks that would let you use this functionality. The reason I haven't done this so far is that there seemed to be no demand for this functionality. | 2) Explanation: I need to preserve XML/HTML entities. For example, | if the document contains % then I want to print that out | exactly, not the parsed/converted value. If I don't do this, then | any random person can embed html markup, etc, which could break an | HTML page. Hmmm. The cleanest solution to this (from an XML/SGML point of view) is probably to use string.replace to escape all '<'s in character data when it is passed to you from the parser. That would also let you retain parser independence and is cleaner in the sense that it becomes more obvious what you're really doing. | However, with xmlproc, it doesn't seem to call any callback that I | can find, it simply looks up the entity in its map and returns it, | or else spits out an error 3021: Undeclared Entity. So my question | is: Is there a suggested workaround? If you don't like the solution above you may want to subclass XMLProcessor in xmlproc.py and write your own versions of parse_charref and parse_ent_ref. Instead of rewriting parse_ent_ref you could also just declare the entities you need in the DTD, and break into the entity hashtable and modify the value of '<'. (I can show you how.) If you don't like any of these solutions, let me know, and we'll think of something. Also: do you need an option to disallow element and attribute declarations in the internal subset? --Lars M. From danda@netscape.com Tue Jun 8 11:22:38 1999 From: danda@netscape.com (Dan Libby) Date: Tue, 08 Jun 1999 03:22:38 -0700 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> Message-ID: <375CEEEE.5CDB4E8F@netscape.com> This is a multi-part message in MIME format. --------------131E02F85FB01E50934F9D12 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lars Marius Garshol wrote: > * Dan Libby > | > | 1) The version of xmlproc I have does not appear to support any > | encoding other than "iso-8859-1". > > This is correct. (Well, US-ASCII will also work, as will anything else > that is based on US-ASCII as long as you don't try to use funny > characters in names or name tokens.) > > | [...] Before, when we were using xmllib, it simply called > | handle_xml(), where we were able to look at the encoding value and > | make appropriate decisions at the application level. Does xmlproc > | have any equivalent functionality, perhaps in a more recent version? > > The functionality is there, but not used at the moment. If you look at > the charconv module you'll see that it contains conversion code for > various encodings as well as registry object for converters. Yes, I saw that while I was grepping for something or other and figured it looked interesting, but was not sure how to plug it in. > If you want I can easily add the hooks that would let you use this > functionality. The reason I haven't done this so far is that there > seemed to be no demand for this functionality. > I would appreciate that. (Consider this 'demand') Actually, Jose is more the demand than I am. Those crazy i18n guys... ;-) If it is a simple change, perhaps you can just send us a diff or something? > | 2) Explanation: I need to preserve XML/HTML entities. For example, > | if the document contains % then I want to print that out > | exactly, not the parsed/converted value. If I don't do this, then > | any random person can embed html markup, etc, which could break an > | HTML page. > > Hmmm. The cleanest solution to this (from an XML/SGML point of view) > is probably to use string.replace to escape all '<'s in character data > when it is passed to you from the parser. That would also let you > retain parser independence and is cleaner in the sense that it becomes > more obvious what you're really doing. > Yes, that is actually the solution I came up with also. It doesn't really seem that clean to me, because if there is a character above 127 that we want to replace with an entity, it gets funny depending on which encoding is in use. Whereas in the old model, we simply had a map from eg "180" to "´" that we returned to the parser and similarly things like "quot" to "&quot;". I tried doing this with entity declarations in the DTD and xmlproc just for kicks. It would allow it for character based entity names, but didn't allow any names starting with a numeric. That means that < would still slip by, even though we could catch < eg: > If you don't like the solution above you may want to subclass > XMLProcessor in xmlproc.py and write your own versions of > parse_charref and parse_ent_ref. > yeah... icky. I like being parser independent. ;-) > Instead of rewriting parse_ent_ref you could also just declare the > entities you need in the DTD, and break into the entity hashtable and > modify the value of '<'. (I can show you how.) > I think that is what I just mentioned trying above, but maybe you mean something else? > If you don't like any of these solutions, let me know, and we'll think > of something. > Replacing afterwards seems to work ok. Really we are mostly just concerned with the "<" and ">". > Also: do you need an option to disallow element and attribute > declarations in the internal subset? Sorry, I'm not sure what this means. What is the internal subset? BTW, Lars, I saw your name in an XML book my roommate just picked up. I forget the title, but it listed xmlproc. Oh, and just now I saw my friend Jim's name on the python profiling page. Totally random! cheers, -dan --------------131E02F85FB01E50934F9D12 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------131E02F85FB01E50934F9D12-- From larsga@ifi.uio.no Tue Jun 8 11:40:09 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 08 Jun 1999 12:40:09 +0200 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc In-Reply-To: <375CEEEE.5CDB4E8F@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> Message-ID: * Dan Libby | | [charconv.py] | | I would appreciate that. (Consider this 'demand') OK. This is a very simple change, and I've written the code before, so I should be able to do this in a couple of days (am very busy at the moment, and only write email while waiting for compiles and such). | If it is a simple change, perhaps you can just send us a diff or | something? You'll get a ZIP file with 0.61.1 in it. (Easier, I think.) * Lars Marius Garshol | | Hmmm. The cleanest solution to this (from an XML/SGML point of view) | is probably to use string.replace to escape all '<'s in character | data when it is passed to you from the parser. That would also let | you retain parser independence and is cleaner in the sense that it | becomes more obvious what you're really doing. * Dan Libby | | Yes, that is actually the solution I came up with also. It doesn't | really seem that clean to me, because if there is a character above | 127 that we want to replace with an entity, it gets funny depending | on which encoding is in use. Well, you control the encoding (after it's gone through xmlproc), so this shouldn't be a problem. | Whereas in the old model, we simply had a map from eg "180" to | "´" that we returned to the parser and similarly things like | "quot" to "&quot;". What you're doing here is letting code control the interpretation of the document, which isn't really all that clean. With and without custom code the document would be different when parsed. Simply remapping characters in the output is IMHO a lot cleaner in that the separation between code and document is clear. | I tried doing this with entity declarations in the DTD and xmlproc | just for kicks. It would allow it for character based entity names, | but didn't allow any names starting with a numeric. This is because < is not an entity reference, it's a direct reference to the Unicode character U+0074, and so it's no wonder that you're not allowed to define such an entity. | I like being parser independent. ;-) Good! It bothers me that most people seem to prefer being chained to whatever product they're using (parser, database, whatever). * Lars Marius Garshol | | Also: do you need an option to disallow element and attribute | declarations in the internal subset? * Dan Libby | | Sorry, I'm not sure what this means. What is the internal subset? Here's an example: is the internal subset --> ]> Vondt! OK! Godt! ... xmlproc and all other validating parsers would let this pass with no complaints at all. I suppose you may not want that. Oh, and BTW, can I list My Netscape on the xmlproc page as xmlproc users? (I'm sure you're desperate for the extra hits. :) --Lars M. From jim@digicool.com Tue Jun 8 19:28:54 1999 From: jim@digicool.com (Jim Fulton) Date: Tue, 08 Jun 1999 18:28:54 +0000 Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation ... References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> Message-ID: <375D60E6.3200CF2F@digicool.com> Some musings.... I'd like to have a very fast and simple parser that can do validation. I'm looking at: - Using (or stealing parts of) xmlproc to parse DTDs, - Using pyexpat, - Writing a C thing that does the validation using data structures (possibly derived from data structures) produced by xmlproc. - Writing a simple C thing that plugs into the C validator, which plugs into pyexpat and takes tables of start and end tag handlers and processes XML to produce Python objects. I've modified pyexpat so that it will spit out the DTD info. (I plan to post an updated pyexpat that implements the full C expat interface defined in the latest stable expat release, unless someone beats me to it. ;) I find that if I tell xmlproc to parse a file containing only a DTD, it will build the DTD related data structures for me, but: - I wonder if there is or should be a tool designed just to do this. Maybe there already is one that I've missed. - Can I rely on the data structures created by the current xmlproc? I'd like to have a tool for processing DTDs independent of parsing XML: - To make it possible to bolt validation onto non-validating parsers, - To separate implementation of validation from implementation of basic parsing and from application object building code. For example, I think handlers that build application objects can be alot simpler if they don't have to check validity. - Allow applications to provide DTDs for documents that don't have them (e.g. xml-rpc marchals). Thoughts? Jim -- Jim Fulton mailto:jim@digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From danda@netscape.com Tue Jun 8 21:55:44 1999 From: danda@netscape.com (Dan Libby) Date: Tue, 08 Jun 1999 13:55:44 -0700 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> Message-ID: <375D8350.6F0F46C6@netscape.com> This is a multi-part message in MIME format. --------------581B652EDB59D0C2E09B5BB5 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit > | Sorry, I'm not sure what this means. What is the internal subset? > > Here's an example: > > > > > ]> > > > > Vondt! > OK! > Godt! > > > ... > > > > xmlproc and all other validating parsers would let this pass with no > complaints at all. I suppose you may not want that. > Oh, I see. Yeah, I wondered what would happen in that case. You're right that we wouldn't want it, however I have a secondary pseudo-schema checker that would not allow the unknown tags, so its not really a problem for us. Further, since "channel" was already defined in the external dtd, shouldn't that generate an error, or does the parser just override it with the internal subset definition? > Oh, and BTW, can I list My Netscape on the xmlproc page as xmlproc > users? (I'm sure you're desperate for the extra hits. :) Sure; it's not actually in production yet, but I'm certainly using it. ;-) -dan --------------581B652EDB59D0C2E09B5BB5 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------581B652EDB59D0C2E09B5BB5-- From danda@netscape.com Tue Jun 8 21:58:52 1999 From: danda@netscape.com (Dan Libby) Date: Tue, 08 Jun 1999 13:58:52 -0700 Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation ... References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> Message-ID: <375D840C.3AD7C01C@netscape.com> This is a multi-part message in MIME format. --------------A93EB40EB4CDB212C19ACA84 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > > - Allow applications to provide DTDs for documents that don't > have them (e.g. xml-rpc marchals). > oh! This would be cool. For RSS 0.9, we didn't require a DTD, but now I'm validating against one. So basically I'm pre-processing the buffer and inserting a DTD. Kind of a hack, but it works. I'd prefer a general solution. -dan --------------A93EB40EB4CDB212C19ACA84 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------A93EB40EB4CDB212C19ACA84-- From larsga@ifi.uio.no Tue Jun 8 23:26:53 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 09 Jun 1999 00:26:53 +0200 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... In-Reply-To: <375D60E6.3200CF2F@digicool.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> Message-ID: * Jim Fulton | | I'd like to have a very fast and simple parser that can do | validation. Hmmm. Maybe a better option than what you've been looking at would be RXP, which is an all-C validating parser. It's a little bit slower than expat, but that should drown in the time occupied by the Python callbacks anyway. I've been thinking about writing a Python interface to RXP, but am not really into C extensions yet and haven't got the time at the moment. | - Using (or stealing parts of) xmlproc to parse DTDs, This is easily possible, and it will buy you some performance, although probably not as much as you'd wish. (Especially for large DTDs xmlproc is slow.) | (I plan to post an updated pyexpat that implements the full | C expat interface defined in the latest stable expat release, | unless someone beats me to it. ;) Great! When you do I'll update the SAX driver. | I find that if I tell xmlproc to parse a file containing only a DTD, | it will build the DTD related data structures for me, but: | | - I wonder if there is or should be a tool designed | just to do this. Maybe there already is one that I've | missed. xmlproc comes with a dtdparser.py module which gives you an event-based interface to DTDs. Combined with the classes in xmldtd.py this gives you the ability to parse a DTD without an associated document. Look in the demo directory for dtddoc.py, which is an example of this. | - Can I rely on the data structures created by the current | xmlproc? Sorry, I don't understand the question. What do you mean by 'rely'? | I'd like to have a tool for processing DTDs independent of | parsing XML: | | [excellent reasons snipped] Yup. These were all part of my motivation for making the DTD parsing module of xmlproc separate from the rest. --Lars M. From jim@digicool.com Wed Jun 9 02:25:45 1999 From: jim@digicool.com (Jim Fulton) Date: Wed, 09 Jun 1999 01:25:45 +0000 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> Message-ID: <375DC299.42477A0A@digicool.com> Lars Marius Garshol wrote: > > * Jim Fulton > | > | I'd like to have a very fast and simple parser that can do > | validation. > > Hmmm. Maybe a better option than what you've been looking at would be > RXP, which is an all-C validating parser. > > I'll check it out. I'm a little bit worried about the license, which is GPL. Maybe I can get him to change it to LGPL. > It's a little bit slower than expat, but that should drown in the time > occupied by the Python callbacks anyway. True, although for alot of our projects, we'll probably write many (most?) of the callbacks in C. > I've been thinking about writing a Python interface to RXP, but am not > really into C extensions yet and haven't got the time at the moment. > > | - Using (or stealing parts of) xmlproc to parse DTDs, > > This is easily possible, and it will buy you some performance, > although probably not as much as you'd wish. (Especially for large > DTDs xmlproc is slow.) In Most cases, I'd expect to amortize DTD parsing over many documents, either by preprocessing standard DTDs or catching DTDs. (snip) > | I find that if I tell xmlproc to parse a file containing only a DTD, > | it will build the DTD related data structures for me, but: > | > | - I wonder if there is or should be a tool designed > | just to do this. Maybe there already is one that I've > | missed. > > xmlproc comes with a dtdparser.py module which gives you an > event-based interface to DTDs. Combined with the classes in xmldtd.py > this gives you the ability to parse a DTD without an associated > document. I suspected this, but I had trouble figuring out the interface. > Look in the demo directory for dtddoc.py, which is an > example of this. Ah, thanks. That should help alot. > | - Can I rely on the data structures created by the current > | xmlproc? > > Sorry, I don't understand the question. What do you mean by 'rely'? I'll write something that takes as input the data structures created internally whan xmlproc parses a document. If you change those data structures, my software will break. :) > | I'd like to have a tool for processing DTDs independent of > | parsing XML: > | > | [excellent reasons snipped] > > Yup. These were all part of my motivation for making the DTD parsing > module of xmlproc separate from the rest. Cool. Jim -- Jim Fulton mailto:jim@digicool.com Technical Director (888) 344-4332 Python Powered! Digital Creations http://www.digicool.com http://www.python.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From tpassin@idsonline.com Wed Jun 9 04:46:15 1999 From: tpassin@idsonline.com (Thomas B. Passin) Date: Tue, 8 Jun 1999 23:46:15 -0400 Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation ... Message-ID: <003101beb22a$a0fa95a0$1c15b0cf@tpassinids> Jim Fulton wrote > ..Allow applications to provide DTDs for documents that don't have them (e.g. xml-rpc marchals). Yes, I think this can be very useful. But if you reverse-engineer a DTD from any existing document, there is no unique solution. The program will therefore try to guess what it should do, and the result will have to be hand-adjusted to make it more usable. I found a product (I forget right now which one) that can create a DTD from an XML example, and tried it. Interesting results, and I had to work on the DTD by hand. Tom Passin From jim@digicool.com Wed Jun 9 12:27:42 1999 From: jim@digicool.com (Jim Fulton) Date: Wed, 09 Jun 1999 11:27:42 +0000 Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation ... References: <003101beb22a$a0fa95a0$1c15b0cf@tpassinids> Message-ID: <375E4FAE.559FC624@digicool.com> "Thomas B. Passin" wrote: > > Jim Fulton wrote > > > ..Allow applications to provide DTDs for documents that don't > have them (e.g. xml-rpc marchals). > > Yes, I think this can be very useful. But if you reverse-engineer a DTD from any existing document, there is no unique solution. That's not what I'm thinking of. My application might require data that follows a DTD, but I might not require incoming data to include the DTD (or even a reference to it). There are XML formats (e.g. XML-RPC) around that are precisely defined, but not with a DTD. I can come up with a DTD for them and validate conforming data that, of course, doesn't include or reference a DTD. Also, by separating DTD parsing from validation, there could be a way of using schema data in other formats (e.g. RSS schema?), as long as the other formats could be parsed into the same data structures that DTD's get parsed to. Jim -- Jim Fulton mailto:jim@digicool.com Technical Director (888) 344-4332 Python Powered! Digital Creations http://www.digicool.com http://www.python.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From Ted.Horst@wdr.com Wed Jun 9 15:27:26 1999 From: Ted.Horst@wdr.com (Ted Horst) Date: Wed, 9 Jun 99 09:27:26 -0500 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... In-Reply-To: <375DC299.42477A0A@digicool.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> <375DC299.42477A0A@digicool.com> Message-ID: <199906091427.AA19568@ch1d2833nwk> You might also check out the xml parser in ILU. It is an all C validationg parser as well, and the license is less restrictive. ftp://ftp.parc.xerox.com/pub/ilu/ilu.html Ted Horst On Wed, 09 Jun 1999, Jim Fulton wrote: > Lars Marius Garshol wrote: > > > > * Jim Fulton > > | > > | I'd like to have a very fast and simple parser that can do > > | validation. > > > > Hmmm. Maybe a better option than what you've been looking at would be > > RXP, which is an all-C validating parser. > > > > > > I'll check it out. I'm a little bit worried about the license, > which is GPL. Maybe I can get him to change it to LGPL. From Fred L. Drake, Jr." References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> Message-ID: <14176.12088.820355.924645@weyr.cnri.reston.va.us> Lars Marius Garshol writes: > be well; I've emailed them, but to no avail) and the python.org one is > not well-formed because it has a ... pair. I can't find this in our work area or on the server, so this problem has aged away. ;-) -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From larsga@ifi.uio.no Thu Jun 10 22:54:08 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 10 Jun 1999 23:54:08 +0200 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... In-Reply-To: <375DC299.42477A0A@digicool.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> <375DC299.42477A0A@digicool.com> Message-ID: * Jim Fulton | | In Most cases, I'd expect to amortize DTD parsing over many | documents, either by preprocessing standard DTDs or catching DTDs. I've looked at this and found it to not be entirely straightforward due to the problems introduced by the internal subset. Also, early tests showed that the speedup from using pickle to load DTD objects was just by a factor of 4 (if I remember correctly) over normal DTD parsing. Anyway, if you disallow the internal subset I have most of the code necessary to do this written although no integrated yet. | [using dtdparser.py] | | I suspected this, but I had trouble figuring out the interface. Feel free to ask if you find the documentation hard to understand. That will help me improve it (once I have the time to do so, at least). | [reliability of xmldtd.py structure] | | I'll write something that takes as input the data structures created | internally whan xmlproc parses a document. If you change those data | structures, my software will break. :) The intention is that all documented APIs in xmlproc should remain unchanged as far as possible (although they can be extended and also change semantics in backward-compatible ways), and although there are many things I would like to clean up I refrain from doing so. So, yes, you should be able to rely on them. --Lars M. From danda@netscape.com Thu Jun 10 23:08:41 1999 From: danda@netscape.com (Dan Libby) Date: Thu, 10 Jun 1999 15:08:41 -0700 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> <375DC299.42477A0A@digicool.com> Message-ID: <37603769.D0A16E7C@netscape.com> This is a multi-part message in MIME format. --------------FC7C188F8BF054F1DA0AE26A Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > I've looked at this and found it to not be entirely straightforward > due to the problems introduced by the internal subset. Also, early > tests showed that the speedup from using pickle to load DTD objects > was just by a factor of 4 (if I remember correctly) over normal DTD > parsing. > Well... excluding potentially having to grab the file off the network somewhere, which would be the slowest operation. That's why in my code I check if the external DTD is in my map, and if so, use a local copy. If the pickling sped it up by another factor of 4, that would be great. > Anyway, if you disallow the internal subset I have most of the code > necessary to do this written although no integrated yet. > I would be very interested in this. -dan --------------FC7C188F8BF054F1DA0AE26A Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------FC7C188F8BF054F1DA0AE26A-- From jim@digicool.com Fri Jun 11 13:45:19 1999 From: jim@digicool.com (Jim Fulton) Date: Fri, 11 Jun 1999 08:45:19 -0400 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> <375DC299.42477A0A@digicool.com> Message-ID: <376104DF.EE7367A3@digicool.com> Lars Marius Garshol wrote: > > * Jim Fulton > > | [using dtdparser.py] > | > | I suspected this, but I had trouble figuring out the interface. > > Feel free to ask if you find the documentation hard to understand. > That will help me improve it (once I have the time to do so, at > least). Actually, I somehow failed to notice the xmlproc directory in the doc directory, so I missed the docs altogether. Jim -- Jim Fulton mailto:jim@digicool.com Python Powered! Technical Director (888) 344-4332 http://www.python.org Digital Creations http://www.digicool.com http://www.zope.org Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email address may not be added to any commercial mail list with out my permission. Violation of my privacy with advertising or SPAM will result in a suit for a MINIMUM of $500 damages/incident, $1500 for repeats. From larsga@ifi.uio.no Fri Jun 11 14:17:48 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 11 Jun 1999 15:17:48 +0200 Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ... In-Reply-To: <37603769.D0A16E7C@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <3755BACF.BDC54B15@netscape.com> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <375D60E6.3200CF2F@digicool.com> <375DC299.42477A0A@digicool.com> <37603769.D0A16E7C@netscape.com> Message-ID: * Lars Marius Garshol | | Anyway, if you disallow the internal subset I have most of the code | necessary to do this written although no integrated yet. * Dan Libby | | I would be very interested in this. Then I'll try to get together an xmlproc 0.62 with this and the charconv stuff. If I can find a free night during the weekend I'll sit down and do all these other little things that have been piling up and put out a slew of new releases. --Lars M. From fredrik@pythonware.com Mon Jun 14 16:32:46 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 14 Jun 1999 17:32:46 +0200 Subject: [XML-SIG] Re: RSS and stuff References: <3754E5B9.96A9FD54@netscape.com> Message-ID: <00e101beb67b$27814b10$f29b12c2@pythonware.com> Dan wrote: > What would you like to see / not see in the format? It really is just > supposed to be a summary. Ideally, we would like to support all of > Dublin Core eventually, but the problem is that the additional data may > not actually be used, and marketing folks felt it would be simpler to > not confuse folks too much. just noticed that the my.userland.com folks are also discussing RDF extensions and supersets. if anyone's interested, check: http://discuss.userland.com/msgReader$7333 http://alchemy.openjava.org/ocs/ From Jeff Rush" (DOM) => Python Object Recipes Message-ID: <199906170716.3331398.6@summit-research.com> I'm just starting to get into XML and just joined this list. I'm working on a Python agent-program that visits bank web pages and fetches checkbook registers, parsing the HTML via the python-xml-0.5.1 stuff into a DOM tree. When finished, it will then spit some DTD flavor of XML into a digitally-signed/encrypted email msg. What I'm looking for is better extraction of HTML tables. Has anyone written a good class for that? I've got a crude one, but am hoping others have done extensive parsing of pages using XML and developed a toolkit. -Jeff Rush ----- cut here ----- class ExtractTable(xml.dom.walker.Walker): def __init__(self, tablenode, trim=0, headings=1, allrows=1): self.rows = [] self.row = [] self.text = "" self.nowhitespace = trim self.keepheadings = headings self.allrows = allrows self.walk(tablenode) def startElement(self, node): if node.get_nodeName() == 'TR': self.row = [] elif self.keepheadings and node.get_nodeName() == 'TH': self.text = "" self.row.append({}) elif node.get_nodeName() == 'TD': self.text = "" self.row.append({}) def endElement(self, node): if self.keepheadings and node.get_nodeName() == 'TH': self.row[-1].update( {'type': 'header', 'value': self.text} ) elif node.get_nodeName() == 'TD': self.row[-1].update( {'type': 'data', 'value': self.text} ) elif node.get_nodeName() == 'A' : self.row[-1]['link'] = node.getAttribute('HREF') elif node.get_nodeName() == 'TR': if self.allrows or len(self.row) > 0: self.rows.append(self.row) def doText(self, node): str = node.get_data() while len(str) and str[0] in ('\r', '\n'): str = str[1:] if self.nowhitespace: str = string.strip(str) self.text = self.text + str def doComment(self, node): pass def doOtherNode(self, node): str = { 'nbsp': ' ' }.get(node.get_nodeName(), None) if str is not None: self.text = self.text + str def ExtractLinks(topnode): """Scan and extract all links in given subtree of HTML page""" links = [] for node in topnode.getElementsByTagName('A'): url = node.getAttribute('HREF') if url: links.append(url) return links ----- cut here ----- From Jeff Rush" I've checked the XML-SIG mailing list archives and the latest CVS for updates to dom/transformer.py but didn't see any. Hence... Bug #1: Throughout the dom/transformer.py, reference is made to 'NodeType' but the correct name is 'nodeType'. Bug #2: While trying to create a subclass of Transformer, in order to strip out HTML formatting/graphics tags, I hit a problem where v0.5.1 of Transformer won't modify the DOM tree it walks. ----- old code ----- new_children = [] for child in node.getChildren(): new_children = new_children + self._transform_node(child) node._children = new_children ----- old code ----- Nodes don't have a '_children' attribute and besides, this doesn't update the node's parentdict, hence any changes are not seen by the higher DOM tree levels. ----- new code ------ new_children = [] for child in node.childNodes: new_children = new_children + self._transform_node(child) for child in node.childNodes[:] : # Remove Old Children node.removeChild(child) for child in new_children: # And Replace with (0 or more) New node.appendChild(child) ----- new code ----- Suggestion #1: Define a __call__ method in the Transformer class that calls the existing transform method, so the following works: class FormatStripper(Transformer): .... strip_formatting = FormatStripper() strip_formatting(doc) I can now write my stripping transformers as: ---------- cut here ---------- class FormatStripper(xml.dom.transformer.Transformer): def do_FONT(self, node): return node.childNodes def do_B(self, node): return node.childNodes def do_I(self, node): return node.childNodes strip_formatting = FormatStripper() class GraphicsStripper(Transformer): def do_HR(self, node): return [] # Remove Horizontal Rules def do_IMG(self, node): return [] # Remove Images def do_MAP(self, node): return [] # Remove Image Maps def do_BODY(self, node): node.removeAttribute("BACKGROUND") node.removeAttribute("BGCOLOR") return [node] strip_graphics = GraphicsStripper() .... doc = strip_formatting( strip_graphics( doc ) ) ---------- cut here ---------- If acceptable, I'd like to see some form of these added to the dom.utils module; they seem to fit in with the strip_whitespace function. -Jeff Rush From steynj@postino.up.ac.za Sat Jun 19 14:30:45 1999 From: steynj@postino.up.ac.za (Jacques Steyn) Date: Sat, 19 Jun 1999 15:30:45 +0200 Subject: [XML-SIG] Inquiry: Python Message-ID: <376B9B85.2627308B@postino.up.ac.za> How can one obtain the Python XML software? Thanks Jacques -- ______________________________________________ Jacques Steyn (PhD) Associate Professor: Multimedia Department of Information Science School for Information Technology University Pretoria Pretoria South Africa Tel +27 12 420 4258 Fax +27 12 362 5181 Email: jsteyn@up.ac.za Web: Information Science http://is.up.ac.za School for Information Technology http://sit.up.ac.za From larsga@ifi.uio.no Sat Jun 19 15:10:08 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 19 Jun 1999 16:10:08 +0200 Subject: [XML-SIG] Inquiry: Python In-Reply-To: <376B9B85.2627308B@postino.up.ac.za> References: <376B9B85.2627308B@postino.up.ac.za> Message-ID: * Jacques Steyn | | How can one obtain the Python XML software? You can find it here: --Lars M. From r.hooft@euromail.net Sun Jun 20 10:30:04 1999 From: r.hooft@euromail.net (Rob Hooft) Date: Sun, 20 Jun 1999 11:30:04 +0200 (MZT) Subject: [XML-SIG] Inquiry: Python In-Reply-To: References: <376B9B85.2627308B@postino.up.ac.za> Message-ID: <14188.46236.604650.495677@octopus.chem.uu.nl> >>>>> "LMG" == Lars Marius Garshol writes: | | How can one obtain the Python XML software? LMG> You can find it here: LMG> Please note that the text on http://www.python.org/sigs/xml-sig/status.html still points to an older version. Maybe that page should be revised. Regards, Rob Hooft. -- ===== R.Hooft@EuroMail.net http://www.xs4all.nl/~hooft/rob/ ===== ===== R&D, Nonius BV, Delft http://www.nonius.nl/ ===== ===== PGPid 0xFA19277D ========================== Use Linux! ========= From Jeff Rush" If by chance you are running some form of Linux that supports the RPM packaging technology, you can grab an easy-to-install XML RPM at my web page: http://starship.python.net/crew/jrush/XML/ -Jeff Rush On Sat, 19 Jun 1999 15:30:45 +0200, Jacques Steyn wrote: >How can one obtain the Python XML software? >Thanks >Jacques From fredrik@pythonware.com Sun Jun 20 16:13:52 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Sun, 20 Jun 1999 17:13:52 +0200 Subject: [XML-SIG] ann: new sgmlop snapshot References: <199906190442.4130325.6@summit-research.com> Message-ID: <001401bebb2f$820cda50$f29b12c2@pythonware.com> subject says most of it; get your copy here: http://www.pythonware.com/madscientist/ coming soon: a lightweight "dom" layer on top of the new Element datatype, unicode support, and more. From Jeff Rush" Is this the same sgmlop as in the XML-SIG CVS? Have your most recent changes gotten into the CVS or should I add your tarball to my RPM explicitly in order to stay up-to-date re sgmlop? an-xml-newbie-not-sure-how-all-the-sig-pieces-are-managed-ly y'rs - Jeff On Sun, 20 Jun 1999 17:13:52 +0200, Fredrik Lundh wrote: >subject says most of it; get your copy here: > >http://www.pythonware.com/madscientist/ > >coming soon: a lightweight "dom" layer on >top of the new Element datatype, unicode >support, and more. > > > > >_______________________________________________ >XML-SIG maillist - XML-SIG@python.org >http://www.python.org/mailman/listinfo/xml-sig > From danda@netscape.com Sun Jun 20 23:00:09 1999 From: danda@netscape.com (Dan Libby) Date: Sun, 20 Jun 1999 15:00:09 -0700 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> <375D8350.6F0F46C6@netscape.com> Message-ID: <376D6469.C3A561C8@netscape.com> This is a multi-part message in MIME format. --------------06E333EA1D62942978B070C3 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Is it possible to reference more than one external DTD? If so, how? I'm hoping that it is possible to include an external DTD from within the internal subset. This would basically allow for limited inheritance. -dan Dan Libby wrote: > > Here's an example: > > > > > > > > > > > ]> > > --------------06E333EA1D62942978B070C3 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------06E333EA1D62942978B070C3-- From larsga@ifi.uio.no Sun Jun 20 23:38:59 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 21 Jun 1999 00:38:59 +0200 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc In-Reply-To: <376D6469.C3A561C8@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> Message-ID: * Dan Libby | | Is it possible to reference more than one external DTD? If so, how? | I'm hoping that it is possible to include an external DTD from | within the internal subset. It is. Or you could do it from the external subset. | This would basically allow for limited inheritance. Hmmm. Some sub-typing would be possible in this way, yes. However, if you want to do that properly you should look at architectural forms. They're much simpler than they sound, and with Geir Ove's xmlarch they're also easy to use. See for more info. (xmlarch is also in the XML-SIG package.) --Lars M. From danda@netscape.com Mon Jun 21 08:09:51 1999 From: danda@netscape.com (Dan Libby) Date: Mon, 21 Jun 1999 07:09:51 +0000 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> Message-ID: <376DE53F.CF441F4@netscape.com> This is a multi-part message in MIME format. --------------8CA34653FD46448DBE65918D Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lars Marius Garshol wrote: > * Dan Libby > | > | Is it possible to reference more than one external DTD? If so, how? > | I'm hoping that it is possible to include an external DTD from > | within the internal subset. > > It is. Or you could do it from the external subset. > Yeah, but how? I tried the following with xmlproc: %otherdtd; ]> This always gives me the error: Illegal construct at 5:3 I tried other variations on the same theme of course, but with similar results. Both files are in the correct path. -dan --------------8CA34653FD46448DBE65918D Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;501 E Middlefield Rd;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com title:Coder Surfer x-mozilla-cpt:;0 fn:Dan Libby end:vcard --------------8CA34653FD46448DBE65918D-- From larsga@ifi.uio.no Mon Jun 21 09:50:59 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 21 Jun 1999 10:50:59 +0200 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc In-Reply-To: <376DE53F.CF441F4@netscape.com> References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> <376DE53F.CF441F4@netscape.com> Message-ID: * Dan Libby | | Yeah, but how? I tried the following with xmlproc: | | | | %otherdtd; | ]> | | This always gives me the error: | Illegal construct at 5:3 This works perfectly for me with the following two files: %ext; ]> and in test2.dtd: This works for me with both the xmlproc in my CVS tree and the one in the XML-SIG CVS tree (which is 0.61), with both validating and non-validating parsing. Which version do you have? (Give me the CVS ID tag in dtdparser.py to be 100% sure that it's right.) --Lars M. From fredrik@pythonware.com Mon Jun 21 13:49:17 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 21 Jun 1999 14:49:17 +0200 Subject: [XML-SIG] ann: new sgmlop snapshot References: <199906201652.1434729.6@summit-research.com> Message-ID: <007901bebbe4$7a340640$f29b12c2@pythonware.com> Jeff Rush wrote: > Is this the same sgmlop as in the XML-SIG CVS? well, I haven't put it there... > Have your most recent changes gotten into the CVS or > should I add your tarball to my RPM explicitly in order to > stay up-to-date re sgmlop? beats me. I just write this stuff, I don't know what people do with it... > an-xml-newbie-not-sure-how-all-the-sig-pieces-are-managed-ly y'rs - Jeff no different from me, then ;-) Cheers /F From larsga@ifi.uio.no Mon Jun 21 14:08:42 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 21 Jun 1999 15:08:42 +0200 Subject: [XML-SIG] ann: new sgmlop snapshot In-Reply-To: <199906201652.1434729.6@summit-research.com> References: <199906201652.1434729.6@summit-research.com> Message-ID: * Jeff Rush | | Is this the same sgmlop as in the XML-SIG CVS? That one is dated 03.Dec.98, so I very much doubt it. AMK probably hasn't had the time to add it in yet. | Have your most recent changes gotten into the CVS or should I add | your tarball to my RPM explicitly in order to stay up-to-date re | sgmlop? I suppose that depends on whether you want your RPMs to reflect the latest XML-SIG package or the latest released software. Perhaps the best is if you get write access to the CVS and can help make things so that the RPM can actually be both at the same time. --Lars M. From akuchlin@mems-exchange.org Mon Jun 21 14:25:30 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Mon, 21 Jun 1999 09:25:30 -0400 (EDT) Subject: [XML-SIG] ann: new sgmlop snapshot In-Reply-To: <199906201652.1434729.6@summit-research.com> References: <199906201652.1434729.6@summit-research.com> Message-ID: <14190.15690.90340.112376@amarok.cnri.reston.va.us> Jeff Rush writes: >Is this the same sgmlop as in the XML-SIG CVS? Have >your most recent changes gotten into the CVS or should >I add your tarball to my RPM explicitly in order to stay >up-to-date re sgmlop? No, I haven't gotten around to updating the CVS tree. Nor have I gotten around to mailing out the passwords for write access to the CVS tree to various people; will try to do that today... -- A.M. Kuchling http://starship.python.net/crew/amk/ We are always living in the final days. What have you got? A hundred years or much, much less until the end of your world. -- From SIGNAL TO NOISE From danda@netscape.com Mon Jun 21 19:27:59 1999 From: danda@netscape.com (Dan Libby) Date: Mon, 21 Jun 1999 11:27:59 -0700 Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc References: <3754E5B9.96A9FD54@netscape.com> <375C9968.3574D93@netscape.com> <375CEEEE.5CDB4E8F@netscape.com> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> <376DE53F.CF441F4@netscape.com> Message-ID: <376E842E.411A16F8@netscape.com> This is a multi-part message in MIME format. --------------0956DD3E644C13D7D92EBF9B Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Okay, stupid mistake. I have code in EntityResolver that maps a network address to a local address. When I moved it out of doctype into an entity, I forgot about that, so it was really pointing at nothing. It works now. One thing puzzles me though: the comment in EntityResolver indicates that resolveEntity will be called to resolve all external entities. Instead, I only see it called for the !DOCTYPE tag, not entities in the internal or external subsets. Also, that error message could use some work, "file not found" is easier to understand. ;-) I don't see any CVS tag in dtdparser.py, but the one in xmlproc.py is: $Id: xmlproc.py,v 1.7 1999/02/10 01:46:03 amk Exp $ -dan Lars Marius Garshol wrote: > * Dan Libby > | > | Yeah, but how? I tried the following with xmlproc: > | > | > | | > | %otherdtd; > | ]> > | > | This always gives me the error: > | Illegal construct at 5:3 > > This works perfectly for me with the following two files: > > > %ext; > ]> > > > > > and in test2.dtd: > > > > This works for me with both the xmlproc in my CVS tree and the one in > the XML-SIG CVS tree (which is 0.61), with both validating and > non-validating parsing. Which version do you have? (Give me the CVS > ID tag in dtdparser.py to be 100% sure that it's right.) > > --Lars M. > > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://www.python.org/mailman/listinfo/xml-sig --------------0956DD3E644C13D7D92EBF9B Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------0956DD3E644C13D7D92EBF9B-- From fredrik@pythonware.com Mon Jun 21 20:47:51 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 21 Jun 1999 21:47:51 +0200 Subject: [XML-SIG] ann: XML RPC client library for Python Message-ID: <01ab01bebc1f$b88e63a0$f29b12c2@pythonware.com> The xmlrpclib module is a client-side implementation of Userland's XML-RPC protocol (www.xmlrpc.com). This protocol allows you to transfer data between Python environments and applications written in for example Java and Perl. It it also fully supported by Userland's Frontier application, of course. Upcoming versions of Zope also speak XML RPC; see http://linux.userland.com/stories/storyReader$18 for more information. This release (0.9.8) uses the sgmlop XML parser if possible. With that parser in place, the XML-RPC packet decoder is up to 20 times faster than before. This release also includes sample XML-RPC servers based on SocketServer and Medusa. Get your copy from: http://www.pythonware.com/products/xmlrpc The most recent version of sgmlop can be downloaded from: http://www.pythonware.com/madscientist

xmlrpclib - XML RPC client library for Python (21-Jun-99) From akuchlin@mems-exchange.org Tue Jun 22 16:37:56 1999 From: akuchlin@mems-exchange.org (Andrew M. Kuchling) Date: Tue, 22 Jun 1999 11:37:56 -0400 (EDT) Subject: [XML-SIG] New CVS server, etc. Message-ID: <199906221537.LAA22882@amarok.cnri.reston.va.us> I've finally gotten around to actually informing people of the new CVS server. (This, some 2 weeks after Greg Stein actually set it up...) The new anonymous CVS server is at: :pserver:anoncvs@cvs.lyra.org:/home/cvsroot Set your CVSROOT environment variable to this, or use the -d flag to specify the server. Consult the anonymous CVS Web page at http://www.python.org/sigs/xml-sig/status.html for detailed instructions on checking out the development tree. Some of you will have received accounts and passwords for write access. You can simply check out a copy of the tree under your account, and then begin making modifications. There's a mailing list for check-in messages, xml-checkins@python.org: anyone can join it at: http://www.python.org/mailman/listinfo/xml-checkins/ You don't need to have write access to the tree to read the checkin mailing list. -- A.M. Kuchling http://starship.python.net/crew/amk/ Autumn, to me the most congenial of seasons: the University, to me the most congenial of lives. -- Robertson Davies, _The Rebel Angels_ From bottoni@cadlab.it Wed Jun 23 08:06:11 1999 From: bottoni@cadlab.it (Alessandro Bottoni) Date: Wed, 23 Jun 1999 09:06:11 +0200 Subject: [XML-SIG] (no subject) Message-ID: <004401bebd46$e036ca00$1f2b2bc1@cadlab.it> This is a multi-part message in MIME format. ------=_NextPart_000_0041_01BEBD57.A37809B0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable unsubscribe ------=_NextPart_000_0041_01BEBD57.A37809B0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

unsubscribe
------=_NextPart_000_0041_01BEBD57.A37809B0-- From danda@netscape.com Wed Jun 23 09:51:14 1999 From: danda@netscape.com (Dan Libby) Date: Wed, 23 Jun 1999 01:51:14 -0700 Subject: [XML-SIG] More entity stuff References: <004401bebd46$e036ca00$1f2b2bc1@cadlab.it> Message-ID: <3770A002.541C20F2@netscape.com> This is a multi-part message in MIME format. --------------AD18C212066C00108106D028 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Okay, so I have a DTD with a bunch of entities copied from the html 3.2 dtd. They look like this: When this is run through xmlproc (xmlval), the entities are ignored. sort of. If I change "¢" to "hello", then hello gets spit out. Further, if I change it to "&#162;" then "#162" gets spit out. This is actually okay with me... I'm just trying to preserve the entity for a browser's use anyway. However, it seems like a weird DTD. So my question: Is this a bug in the dtd parser, or is this correct behavior? If the latter, does my DTD hack seem like the right thing to do? thx. -dan Alessandro Bottoni wrote: > unsubscribe --------------AD18C212066C00108106D028 Content-Type: text/x-vcard; charset=us-ascii; name="danda.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dan Libby Content-Disposition: attachment; filename="danda.vcf" begin:vcard n:Libby;Dan x-mozilla-html:TRUE org:Netscape Communications adr:;;;Mountain View;CA;94043;USA version:2.1 email;internet:danda@netscape.com x-mozilla-cpt:;0 tel;home:650-964-5913 tel;work:650-937-2276 fn:Dan Libby end:vcard --------------AD18C212066C00108106D028-- From fredrik@pythonware.com Wed Jun 23 11:01:40 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 23 Jun 1999 12:01:40 +0200 Subject: [XML-SIG] ann: newschannel maker References: Message-ID: <003801bebd5f$648345a0$f29b12c2@pythonware.com> a while ago, Lars wrote: > I sat down yesterday and had a look at RSS, a format for news > headlines which is used by Slashdot, mozilla.org and Scripting News, > among others. It was very simple (a bit too simple, in fact), so I sat > down and made a simple RSS library and client in Python. This client > produces a web page when it is run. (I run it from cron.) potential news providers might be interested in my little "newschannel" tool, available from: http://www.pythonware.com/madscientist/ this tool reads an HTML document (from file or from a site), and extracts news items marked with special tags. it then generates perfectly valid RDF and Scripting- News 2.0 (*) news channel files. see the README and the mynews.py sample script for more information. http://www.pythonware.com/news.rdf http://www.pythonware.com/people/fredrik/news.rdf (replace .rdf with .xml for scriptingnews versions) *) the "fat" format supported by my.userland.com. see: http://my.userland.com/stories/storyReader$11 From r.hooft@euromail.net Wed Jun 23 15:51:11 1999 From: r.hooft@euromail.net (Rob Hooft) Date: Wed, 23 Jun 1999 16:51:11 +0200 (MZT) Subject: [XML-SIG] Bug in exception handling? Message-ID: <14192.62559.488258.872507@octopus.chem.uu.nl> I really have no clue where to start looking for the following problem: not well-formed Traceback (innermost last): File "/usr/local/nonius/app/scripts/comparehkl.py", line 459, in ? reflections.Source(file1).SendTo(ref1.Reflection) File "/usr/local/nonius/app/interface/evaly.py", line 79, in SendTo parser.parseFile(projtls.myopen(self.filename,'r')) File "/usr/local/nonius/lib/python1.5/site-packages/xml/sax/drivers/drv_pyexpat.py", line 73, in parseFile self.__report_error() File "/usr/local/nonius/lib/python1.5/site-packages/xml/sax/drivers/drv_pyexpat.py", line 89, in __report_error self.err_handler.fatalError(saxlib.SAXParseException(msg,None,self)) File "/usr/local/nonius/app/interface/evaly.py", line 20, in fatalError raise exception xml.sax.saxlib.SAXParseExceptionzsh: segmentation fault comparehkl final.y The routine that is causing this is: def fatalError(self, exception): print exception.msg raise exception How does this crash the python interpreter? xml.sax.saxlib.SAXParseException Program received signal SIGSEGV, Segmentation fault. normal_updatePosition (enc=0x4020414c, ptr=0x4020c01c
, end=0x4020c180
, pos=0x81b26d8) at xmltok/xmltok_impl.c:1618 1618 switch (BYTE_TYPE(enc, ptr)) { (gdb) where #0 normal_updatePosition (enc=0x4020414c, ptr=0x4020c01c
, end=0x4020c180
, pos=0x81b26d8) at xmltok/xmltok_impl.c:1618 #1 0x401f0ffd in XML_GetCurrentLineNumber (parser=0x81b2590) at xmlparse/xmlparse.c:642 #2 0x401f028c in xmlparse_getattr (self=0x81b2420, name=0x810cd44 "ErrorLineNumber") at ./pyexpat.c:349 #3 0x806d60b in PyObject_GetAttrString (v=0x81b2420, name=0x810cd44 "ErrorLineNumber") at object.c:381 #4 0x806d729 in PyObject_GetAttr (v=0x81b2420, name=0x810cd30) at object.c:438 #5 0x80742ce in eval_code2 (co=0x819e6c8, globals=0x810eae8, locals=0x0, args=0x81057d8, argcount=1, kws=0x81057dc, kwcount=0, defs=0x0, defcount=0, owner=0x81afc38) at ceval.c:1380 #6 0x80748bd in eval_code2 (co=0x817a548, globals=0x8180060, locals=0x0, args=0x8176e8c, argcount=1, kws=0x8176e90, kwcount=0, defs=0x0, defcount=0, owner=0x81805e8) at ceval.c:1610 #7 0x80748bd in eval_code2 (co=0x817ab18, globals=0x8180060, locals=0x0, args=0x810d784, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x81805e8) at ceval.c:1610 #8 0x8075d60 in call_function (func=0x8180650, arg=0x810d778, kw=0x0) at ceval.c:2481 #9 0x8075942 in PyEval_CallObjectWithKeywords (func=0x816a298, arg=0x0, kw=0x0) at ceval.c:2319 #10 0x806d33e in PyObject_Str (v=0x810e668) at object.c:260 #11 0x805bab3 in PyErr_PrintEx (set_sys_last_vars=1) at pythonrun.c:816 #12 0x805b646 in PyErr_Print () at pythonrun.c:667 #13 0x805b3cc in PyRun_SimpleFile (fp=0x8098578, filename=0xbffff924 "scripts/comparehkl.py") at pythonrun.c:572 #14 0x805b061 in PyRun_AnyFile (fp=0x8098578, filename=0xbffff924 "scripts/comparehkl.py") at pythonrun.c:450 #15 0x804ef11 in Py_Main (argc=4, argv=0xbffff7fc) at main.c:286 #16 0x804e9b2 in main (argc=4, argv=0xbffff7fc) at python.c:12 The worst is: if I use only the first 4886 lines of the file, the "not well-formed" error message correctly reports the problem in line 7, column 37 of the file, but if I include 4887 or more, I get the above core dump. The 4887 line file is 131053 bytes, just under 128kB? Can I do something to fix this? Regards, Rob Hooft. -- ===== R.Hooft@EuroMail.net http://www.xs4all.nl/~hooft/rob/ ===== ===== R&D, Nonius BV, Delft http://www.nonius.nl/ ===== ===== PGPid 0xFA19277D ========================== Use Linux! ========= From jack@oratrix.nl Wed Jun 23 21:10:33 1999 From: jack@oratrix.nl (Jack Jansen) Date: Wed, 23 Jun 1999 22:10:33 +0200 Subject: [XML-SIG] Bug in exception handling? In-Reply-To: Message by r.hooft@euromail.net (Rob Hooft) , Wed, 23 Jun 1999 16:51:11 +0200 (MZT) , <14192.62559.488258.872507@octopus.chem.uu.nl> Message-ID: <19990623201038.1D9CF126BC4@oratrix.oratrix.nl> Rob, my first guess would be a mismatch in the Python build: if pyexpat is compiled as a dynamic library it may have been linked against an older version of Python, or one of the "critical" build options (refcount debugging and such) was different. This can also happen in statically built Pythons, as the dependencies aren't fully specified. As a first try I would do a "make clean" and rebuild the world. If the problem persists my next guess would be a buffer overflow. The "address" 0x4020c11f looks rather too much like ascii for my liking. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From larsga@ifi.uio.no Wed Jun 23 23:04:04 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 24 Jun 1999 00:04:04 +0200 Subject: [XML-SIG] More entity stuff In-Reply-To: <3770A002.541C20F2@netscape.com> References: <004401bebd46$e036ca00$1f2b2bc1@cadlab.it> <3770A002.541C20F2@netscape.com> Message-ID: * Dan Libby | | Okay, so I have a DTD with a bunch of entities copied from the html 3.2 | dtd. They look like this: | | | | When this is run through xmlproc (xmlval), the entities are ignored. | sort of. This is a bug. I've seen it before, but thought I'd fixed it. The trouble is that the entity is only one character long (after the character reference is resolved) and that causes xmlproc to screw up for some reason. If you insert a space in the declaration (before or after) the character reference the problem goes away. This turned out to be a rather subtle problem and finding a solution that passed the regression test in a satisfying way took a while. The patches below seem correct, though. Thanks for reporting this! === xmlproc.py *************** *** 72,78 **** def do_parse(self): "Does the actual parsing." try: ! while self.pos+1 References: <14192.62559.488258.872507@octopus.chem.uu.nl> <19990623201038.1D9CF126BC4@oratrix.oratrix.nl> Message-ID: <14194.5192.835922.637094@octopus.chem.uu.nl> >>>>> "JJ" == Jack Jansen writes: JJ> my first guess would be a mismatch in the Python build: if pyexpat is JJ> compiled as a dynamic library it may have been linked against an older JJ> version of Python, or one of the "critical" build options (refcount JJ> debugging and such) was different. It doesn't look like that.... What I (by accident) did find is that it has something to do with Refcounting: The current code (drv_pyexpat) looks like: if not self.parser.Parse(fileobj.read(),1): self.__report_error() If I replace that by buf=fileobj.read() if not self.parser.Parse(buf,1): self.__report_error() The exception does not dump core. The "by accident" I'm talking about is that I tried to eliminate the "sax" layer from the code, because in the profile listing of a test parse, the top routines were all in drv_pyexpat: 21989 4.600 0.000 6.930 0.000 evaly.py:87(HandleReflection) 21989 5.070 0.000 7.950 0.000 evaly.py:102(HandleEndReflection) 117706 7.490 0.000 7.490 0.000 saxutils.py:86(__init__) 21989 8.760 0.000 13.080 0.001 evaly.py:95(HandleIntensity) 22733 10.130 0.000 16.400 0.001 evaly.py:90(HandleIndex) 134166 12.920 0.000 12.920 0.000 saxutils.py:113(__getitem__) 154259 14.020 0.000 14.020 0.000 evaly.py:55(characters) 117706 14.190 0.000 22.140 0.000 evaly.py:63(endElement) 117706 16.540 0.000 38.680 0.000 drv_pyexpat.py:45(endElement) 117706 19.330 0.000 55.740 0.000 evaly.py:50(startElement) 154259 28.090 0.000 42.110 0.000 drv_pyexpat.py:48(characters) 117706 41.440 0.000 104.670 0.001 drv_pyexpat.py:38(startElement) 1 47.530 47.530 232.990 232.990 drv_pyexpat.py:58(parseFile) I think especially that: def startElement(self,name,attrs): at = {} for i in range(0, len(attrs), 2): at[attrs[i]] = attrs[i+1] self.doc_handler.startElement(name,saxutils.AttributeMap(at)) is very expensive, as I'm not normally using the attributes on most of the elements. For me, a lazy version of AttributeMap would help a bit. Bypassing sax altogether and using pyexpat directly reduces parsing time with 40%. 45 seconds on a "moderately sized" file (some of my clients have files that are going to be 20 times bigger still, i.e. 60MB of XML) is still considerably long, so I'll need to speed it up a bit more to make it really usable. Regards, Rob Hooft. -- ===== R.Hooft@EuroMail.net http://www.xs4all.nl/~hooft/rob/ ===== ===== R&D, Nonius BV, Delft http://www.nonius.nl/ ===== ===== PGPid 0xFA19277D ========================== Use Linux! ========= From fredrik@pythonware.com Thu Jun 24 13:10:49 1999 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 24 Jun 1999 14:10:49 +0200 Subject: [XML-SIG] Bug in exception handling? References: <14192.62559.488258.872507@octopus.chem.uu.nl><19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl> Message-ID: <000b01bebe3a$99e38d00$f29b12c2@secret.pythonware.com> Rob Hooft wrote: > Bypassing sax altogether and using pyexpat directly reduces parsing > time with 40%. 45 seconds on a "moderately sized" file (some of my > clients have files that are going to be 20 times bigger still, > i.e. 60MB of XML) is still considerably long, so I'll need to speed it > up a bit more to make it really usable. with a little luck, you might be able to use sgmlop instead (it cannot handle all possible XML constructs yet, but it might work on your material). here's a simple benchmark, run on an old 200 MHz pentium box, under NT: > dir big.xml 99-06-24 13:47 62 078 532 big.xml > python benchxml.py big.xml sgmlop/null parser: 8.567 seconds; 7246131 bytes per second sgmlop/dummy parser: 51.943 seconds; 1195134 bytes per second ^C (didn't have time to wait for the standard xmllib implementation to finish...) in this test, the null parser defines no parser callbacks at all, so it basically measures the time it takes sgmlop to read the file from disk, and to split it into elements. the dummy parser defines all python callbacks as empty methods. as you see, it's quite expensive to call Python methods from C. if you're going to DO things with the data, things get even worse... (but a few hundred kb's per second on a similar box should be no problem). get your copy from: http://www.pythonware.com/madscientist/ From larsga@ifi.uio.no Thu Jun 24 13:31:53 1999 From: larsga@ifi.uio.no (Lars Marius Garshol) Date: 24 Jun 1999 14:31:53 +0200 Subject: [XML-SIG] Bug in exception handling? In-Reply-To: <14194.5192.835922.637094@octopus.chem.uu.nl> References: <14192.62559.488258.872507@octopus.chem.uu.nl> <19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl> Message-ID: * Rob Hooft | | What I (by accident) did find is that it has something to do with | Refcounting: The current code (drv_pyexpat) looks like: | | if not self.parser.Parse(fileobj.read(),1): | self.__report_error() | | If I replace that by | buf=fileobj.read() | if not self.parser.Parse(buf,1): | self.__report_error() | | The exception does not dump core. Aha! Thanks for this observation. I've checked your patch into my driver source now, so it will be in the next release. Once I finish my thesis I'll get this SAX work back on the rails again. A JPython-compatible version, easySAX and SAX2 are all hovering over me, but there's no time to do them properly now, and I think it's better not to do them at all than to not do them properly. Hopefully it's only a matter of weeks. Hopefully. | The "by accident" I'm talking about is that I tried to eliminate the | "sax" layer from the code, because in the profile listing of a test | parse, the top routines were all in drv_pyexpat: This isn't as surprising as it might be. I think the best solution would be to have the drivers for expat and sgmlop be written entirely in C. | I think especially that: | | def startElement(self,name,attrs): | at = {} | for i in range(0, len(attrs), 2): | at[attrs[i]] = attrs[i+1] | | self.doc_handler.startElement(name,saxutils.AttributeMap(at)) | | is very expensive, as I'm not normally using the attributes on most of | the elements. For me, a lazy version of AttributeMap would help a bit. I had some spare time while waiting for my advisor now, so I wrote one up for you. It's been tested a little, but not 100%. It's at: If you want an even lazier driver you can use this one: class LazyExpatDriver(SAX_expat): def __init__(self): SAX_expat.__init__(self) self.map=LazyAttributeMap([]) def startElement(self,name,attrs): self.map.list=attrs self.doc_handler.startElement(name,self.map) Feedback on speed differences between these three drivers (original, the one on the web and the one in this post) would be interesting. --Lars M. From r.hooft@euromail.net Thu Jun 24 15:16:43 1999 From: r.hooft@euromail.net (Rob Hooft) Date: Thu, 24 Jun 1999 16:16:43 +0200 (MZT) Subject: [XML-SIG] Bug in exception handling? In-Reply-To: References: <14192.62559.488258.872507@octopus.chem.uu.nl> <19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl> Message-ID: <14194.15819.566865.167990@octopus.chem.uu.nl> >>>>> "LMG" == Lars Marius Garshol writes: LMG> Feedback on speed differences between these three drivers (original, LMG> the one on the web and the one in this post) would be interesting. devel[445]cubic%% ls -l final.y -rw-r--r-- 1 hooft hooft 3963562 Jun 24 13:04 final.y My sax-less version: Reading reflection file... 43.05 seconds The original: Reading reflection file... 74.37 seconds The Web version (without activating the lazy code): Reading reflection file... 348.88 seconds Oops. something else changed? I made up a lazy version myself, using the old 0.10 version of the file, and a lazy map that is a bit less lazy than the one you made up. Reading reflection file... 71.87 seconds Conclusion: this is not the real problem, making up the dictionary is not so expensive in comparison with the rest of the SAX layer. For completeness, here is my added code: class LazyExpatDriver(SAX_expat): def startElement(self,name,attrs): self.doc_handler.startElement(name,LazyAttributeMap(attrs)) # --- A lazy attribute map # This avoids the costly conversion from a list to a hash table if the attribute # list is not needed anywhere. class LazyAttributeMap: """An implementation of AttributeList that takes an (attr,val) hash and uses it to implement the AttributeList interface.""" def __init__(self, list): self.lst=list self.map=None def _mkmap(self): self.map={} for i in range(0,len(self.lst),2): self.map[self.lst[i]]=self.lst[i+1] def getLength(self): return len(self.list()/2) def getName(self, i): if self.map is None: self._mkmap() try: return self.map.keys()[i] except IndexError,e: return None def getType(self, i): return "CDATA" def getValue(self, i): if self.map is None: self._mkmap() try: if type(i)==types.IntType: return self.map[self.getName(i)] else: return self.map[i] except KeyError,e: return None def __len__(self): return len(self.lst()/2) def __getitem__(self, key): if self.map is None: self._mkmap() if type(key)==types.IntType: return self.map.keys()[key] else: return self.map[key] def items(self): if self.map is None: self._mkmap() return self.map.items() def keys(self): if self.map is None: self._mkmap() return self.map.keys() def has_key(self,key): if self.map is None: self._mkmap() return self.map.has_key(key) def get(self, key, alternative): """Return the value associated with attribute name; if it is not available, then return the alternative.""" if self.map is None: self._mkmap() return self.map.get(key, alternative) # --- def create_parser(): #return SAX_expat() return LazyExpatDriver() From r.hooft@euromail.net Thu Jun 24 14:09:18 1999 From: r.hooft@euromail.net (Rob Hooft) Date: Thu, 24 Jun 1999 15:09:18 +0200 (MZT) Subject: [XML-SIG] Bug in exception handling? In-Reply-To: References: <14192.62559.488258.872507@octopus.chem.uu.nl> <19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl> Message-ID: <14194.11774.643335.39150@octopus.chem.uu.nl> >>>>> "LMG" == Lars Marius Garshol writes: LMG> * Rob Hooft LMG> | LMG> | What I (by accident) did find is that it has something to do with LMG> | Refcounting: The current code (drv_pyexpat) looks like: LMG> | LMG> | if not self.parser.Parse(fileobj.read(),1): LMG> | self.__report_error() LMG> | LMG> | If I replace that by LMG> | buf=fileobj.read() LMG> | if not self.parser.Parse(buf,1): LMG> | self.__report_error() LMG> | LMG> | The exception does not dump core. LMG> Aha! Thanks for this observation. I've checked your patch into my LMG> driver source now, so it will be in the next release. Sounds like a hack to me. Shouldn't it be solved by INCREF'ing the buffer somewhere in the C code to pyexpat? e.g. where the exception code makes reference to the buffer? I didn't look at the code myself, so I don't know whether it is particularly difficult to find. It would also be nice if the pyexpat parser would report the currect line number for a problem even if the file is parsed in pieces (you may have noticed that I was talking about 60MB files before, it is not really nice to suck those into a single string). Rob Hooft. -- ===== R.Hooft@EuroMail.net http://www.xs4all.nl/~hooft/rob/ ===== ===== R&D, Nonius BV, Delft http://www.nonius.nl/ ===== ===== PGPid 0xFA19277D ========================== Use Linux! ========= From r.hooft@euromail.net Thu Jun 24 13:37:23 1999 From: r.hooft@euromail.net (Rob Hooft) Date: Thu, 24 Jun 1999 14:37:23 +0200 (MZT) Subject: [XML-SIG] Bug in exception handling? In-Reply-To: <000b01bebe3a$99e38d00$f29b12c2@secret.pythonware.com> References: <14192.62559.488258.872507@octopus.chem.uu.nl> <19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl> <000b01bebe3a$99e38d00$f29b12c2@secret.pythonware.com> Message-ID: <14194.9859.228589.169512@octopus.chem.uu.nl> >>>>> "FL" == Fredrik Lundh writes: FL> Rob Hooft wrote: >> Bypassing sax altogether and using pyexpat directly reduces parsing >> time with 40%. 45 seconds on a "moderately sized" file (some of my >> clients have files that are going to be 20 times bigger still, >> i.e. 60MB of XML) is still considerably long, so I'll need to speed it >> up a bit more to make it really usable. FL> with a little luck, you might be able to use sgmlop instead FL> (it cannot handle all possible XML constructs yet, but it FL> might work on your material). FL> here's a simple benchmark, run on an old 200 MHz pentium FL> box, under NT: >> dir big.xml FL> 99-06-24 13:47 62 078 532 big.xml >> python benchxml.py big.xml FL> sgmlop/null parser: 8.567 seconds; 7246131 bytes per second FL> sgmlop/dummy parser: 51.943 seconds; 1195134 bytes per second FL> ^C I'm using a 200MHz pentium as well, but I think the biggest problem is the kind of data I'm handling. It is mostly numerical. We're still working on the DTD, but I can show you a typical fragment: ... ... I think a large part of my time with any parser will be spent in atof() and atoi().... I'll try sgmlop as soon as I can. Rob -- ===== R.Hooft@EuroMail.net http://www.xs4all.nl/~hooft/rob/ ===== ===== R&D, Nonius BV, Delft http://www.nonius.nl/ ===== ===== PGPid 0xFA19277D ========================== Use Linux! ========= From paul@prescod.net Mon Jun 28 17:00:50 1999 From: paul@prescod.net (Paul Prescod) Date: Mon, 28 Jun 1999 12:00:50 -0400 Subject: [XML-SIG] [Fwd: Re: parsers for Palm?] Message-ID: <37779C32.780A9134@prescod.net> > Expat 1.1 added a compile-time option to allow a smaller (and slightly > slower) parser. With this option on Win32 it compiles into a single DLL > that compresses to 23k. Is that too large for Palm? > > James Wow. I didn't notice that Expat was so small now. I think that we should certainly move for Python 1.6 to include eXpat and easysax. At compile time, Unix Python users could choose whether they want small or fast. For Windows we could just make both DLLs available (though only the small one would be built-in to the distribution). 23K for something as significant as massively-accelarated XML seems like a small price. Note that this 23k includes full Unicode support and is completely Ansi C, just like Python. Also, I understand that it now supports internal and external, general and parameter entities. In other words, almost everything except validation! Opinions? Paul Prescod