From Jacco.van.Ossenbruggen@cwi.nl  Tue Jun  1 15:02:36 1999
From: Jacco.van.Ossenbruggen@cwi.nl (J.R. van Ossenbruggen)
Date: Tue, 01 Jun 1999 16:02:36 +0200
Subject: [XML-SIG] patch to xml/dom/esis_builder.py
Message-ID: <UTC199906011402.QAA03513.jrvosse@klipper.cwi.nl>

Hi all,

I use the xml package in a mixed SGML/XML environment.  I directly
process XML but use SP to convert SGML to ESIS first.  This works
fine, except for two minor modifications I needed to make to
esis_builder.py.  

The first modification prevents a crash on "#IMPLIED" attributes, in
which case the current version fails to notice that the third argument
(the value of the attr) is missing.  The second modification provides
an optional argument to EsisBuilder.__init__, which allows me to pass
string.lower or string.upper functions to do the necessary case
conversions (SGML names are not case sensitive and converted to
uppercase by SP, XML names are not changed when converted to ESIS).

I include the relevant patch below.  I think the changes could be
useful for more people, and as far as I know,do not break any existing
code.  I'd be grateful if they are included in the main distribution.

Let me know what you think,

	Jacco


---
Index: esis_builder.py
===================================================================
RCS file: /projects/cvsroot/xml/dom/esis_builder.py,v
retrieving revision 1.5
diff -c -r1.5 esis_builder.py
*** esis_builder.py	1999/03/18 12:38:28	1.5
--- esis_builder.py	1999/06/01 12:09:09
***************
*** 27,37 ****
  
  class EsisBuilder(Builder):
      
!     def __init__(self):
          Builder.__init__(self)
          self.attr_store = {}
          self.id_store = {}
          #self.sdata_handler = handle_sdata
  
      def feed(self, data):
          for line in string.split(data, '\n'):
--- 27,39 ----
  
  class EsisBuilder(Builder):
      
!     def __init__(self, convert=lambda x:x):
          Builder.__init__(self)
          self.attr_store = {}
          self.id_store = {}
          #self.sdata_handler = handle_sdata
+         # convert may, for example, be used to handle case conversion 
+         self.convert = convert
  
      def feed(self, data):
          for line in string.split(data, '\n'):
***************
*** 41,46 ****
--- 43,49 ----
              text = line[1:]
  
              if event == '(':
+                 text = self.convert(text)
                  element = self.document.createElement(text, self.attr_store)
                  self.attr_store = {}
                  self.push(element)
***************
*** 50,57 ****
  
              elif event == 'A':
                  l = re.split(' ', text, 2)
!                 name = l[0]
!                 value = ESISDecode(l[2])
                  self.attr_store[name] = value
  
              elif event == '-':
--- 53,64 ----
  
              elif event == 'A':
                  l = re.split(' ', text, 2)
!                 name = self.convert(l[0])
!                 if l[1] == 'IMPLIED':
!                     # fix this. Needs to be undefined attr
!                     value = ''
!                 else:
!                     value = ESISDecode(l[2])
                  self.attr_store[name] = value
  
              elif event == '-':


From Fred L. Drake, Jr." <fdrake@acm.org  Tue Jun  1 19:37:38 1999
From: Fred L. Drake, Jr." <fdrake@acm.org (Fred L. Drake)
Date: Tue, 1 Jun 1999 14:37:38 -0400 (EDT)
Subject: [XML-SIG] patch to xml/dom/esis_builder.py
In-Reply-To: <UTC199906011402.QAA03513.jrvosse@klipper.cwi.nl>
References: <UTC199906011402.QAA03513.jrvosse@klipper.cwi.nl>
Message-ID: <14164.10354.491850.334248@weyr.cnri.reston.va.us>

J.R. van Ossenbruggen writes:
 > I include the relevant patch below.  I think the changes could be
 > useful for more people, and as far as I know,do not break any existing
 > code.  I'd be grateful if they are included in the main distribution.

  I like this, but have two changes.  First, the default convert
function could be str instead of a lambda; this would be faster since
str() is implemented in C (or Java in JPython).
  The second change concerns this part of the patch:

 > !                 name = self.convert(l[0])
 > !                 if l[1] == 'IMPLIED':
 > !                     # fix this. Needs to be undefined attr
 > !                     value = ''
 > !                 else:
 > !                     value = ESISDecode(l[2])
 >                   self.attr_store[name] = value

  This could be something like this:

	if l[1] != 'IMPLIED':
	    self.attr_store[self.convert(l[0])] = ESISDecode(l[2])

This does just as much as needed, and doesn't create the bogus
attribute entry in the dictionary.


  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives


From Fred L. Drake, Jr." <fdrake@acm.org  Tue Jun  1 19:59:18 1999
From: Fred L. Drake, Jr." <fdrake@acm.org (Fred L. Drake)
Date: Tue, 1 Jun 1999 14:59:18 -0400 (EDT)
Subject: [XML-SIG] Creating DOM fragments
Message-ID: <14164.11654.684418.724401@weyr.cnri.reston.va.us>

  In the spirit of making changes to the DOM builder, here's something 
I've played with a little.
  When I started working on the conversion of the Python documentation 
to SGML/XML, I was building DOM objects that weren't legal: the LaTeX
files don't map to hierarchical structures, even if you ignore the
document preamble.  While the documents themselves can be treated as
hierachical in this specific case, that's not the case for individual
files, which is the level I want to work at.  When Andrew fixed the
Document class to be less forgiving, I had to change the way I used
it, building the more reasonable DocumentFragment objects instead.
I'm driving the whole conversion across ESIS streams, so I wanted the
ESIS builder to be able to build fragments instead of documents.  I've 
been using a custom subclass that added the needed functionality for
this (and some other stuff), but this would probably be very useful
for others doing conversion processes.
  I think the appended patch would be useful for others.  It adds a
method to xml.dom.builder.Builder called buildFragment(); it has to be 
called before document construction starts and causes a fragment to be 
built instead.  The fragment can be found as the "fragment" attribute
of the builder or the return value of buildFragment().
  Does this make sense as part of the base class?  Or is this too
special a situation?


  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives


diff -c -r1.10 builder.py
*** builder.py	1999/03/18 12:38:28	1.10
--- builder.py	1999/06/01 18:58:29
***************
*** 15,22 ****
--- 15,31 ----
  
      def __init__(self):
          self.document = createDocument()
+         self.fragment = None
+         self.target = self.document
          self.current_element = None
  
+     def buildFragment(self):
+         if self.fragment or len(self.document.childNodes):
+             raise RuntimeError, \
+                   "cannot build fragment once document has been started"
+         self.fragment = self.document.createDocumentFragment()
+         self.target = self.fragment
+         return self.fragment
  
      def push(self, node):
          "Add node to current node and move to new node."
***************
*** 24,35 ****
          nodetype = node.get_nodeType()
          if self.current_element:
              self.current_element.insertBefore(node, None)
!         elif nodetype in _LEGAL_DOCUMENT_CHILDREN:
              if nodetype == TEXT_NODE:
                  if string.strip(node.get_nodeValue()) != "":
!                     self.document.appendChild(node)
              else:
!                 self.document.appendChild(node)
  
          if nodetype == ELEMENT_NODE:
              self.current_element = node
--- 33,44 ----
          nodetype = node.get_nodeType()
          if self.current_element:
              self.current_element.insertBefore(node, None)
!         elif self.fragment or nodetype in _LEGAL_DOCUMENT_CHILDREN:
              if nodetype == TEXT_NODE:
                  if string.strip(node.get_nodeValue()) != "":
!                     self.target.appendChild(node)
              else:
!                 self.target.appendChild(node)
  
          if nodetype == ELEMENT_NODE:
              self.current_element = node


From jkraai@murl.com  Wed Jun  2 05:14:09 1999
From: jkraai@murl.com (jkraai)
Date: Wed, 02 Jun 1999 04:14:09 +0000
Subject: [XML-SIG] XML -> DTD lib?
References: <14164.11654.684418.724401@weyr.cnri.reston.va.us>
Message-ID: <3754AF91.38CFDE12@murl.com>

Anyone have a DTD generator?

I feel like I should not need such a thing, the DTD 
should have been written and it shouldn't have to be 
reverse-engineered.

What I'd like to do is to give my users the ability 
to describe a record, then calculate a DTD for that 
record.

This would be a great exercise for me to better 
understand XML, but if the code already exists ...

Thanks for such great code everybody,

--jim


From danda@netscape.com  Wed Jun  2 09:05:13 1999
From: danda@netscape.com (Dan Libby)
Date: Wed, 02 Jun 1999 01:05:13 -0700
Subject: [XML-SIG] Re: RSS and stuff
Message-ID: <3754E5B9.96A9FD54@netscape.com>

Lars, glad to see that others are using the format, even if it is "too
simple".  ;-)   I'm sure you'll be glad to hear that we are doing our
validation with python and the excellent XML libraries you all have
contributed to.

FYI, the current validator is very specific.  It understands the "0.9"
format intimately at the code level.  However, in my spare time I've
been working on a generic validator that will read in a schema file (of
my own devise, not a real XML schema) that's written in XML, and then
validate a document based on that.  That way, format changes should be
simple to implement, at least from a validation standpoint.  Hopefully I
can get it installed soon, and possibly even distribute the source, such
as it is.  (This is my first Python + first DOM coding project).  This
seems like a pretty obvious thing to me, I'm surprised that XML has
gotten as far as it has without real support for enforcing data types,
lengths, ranges, etc.

> I sat down yesterday and had a look at RSS, a format for news
> headlines which is used by Slashdot, mozilla.org and Scripting News,
> among others. It was very simple (a bit too simple, in fact), so I sat
> down and made a simple RSS library and client in Python. This client
> produces a web page when it is run. (I run it from cron.)
>

What would you like to see / not see in the format?  It really is just
supposed to be a summary. Ideally, we would like to support all of
Dublin Core eventually, but the problem is that the additional data may
not actually be used, and marketing folks felt it would be simpler to
not confuse folks too much.


> The 'specification' and lists of providers can be found at:
>
> (warning: the RSS guide is not very accurate technically)
>

What in particular did you find that was inaccurate?  I agree it is not
very technical, as it is aimed at a pretty general audience, however, it
should be pretty accurate.

This brings me to another question.  Do you all believe it is the "right
thing" to publish a DTD for a format, even if the DTD by itself is not
sufficient to validate the document?  In other words, an XML editor
application referencing the DTD would allow the user to construct a
document that is non-valid with regards to our rules.  It seems to me
that the DTD then becomes something of a distraction, because compliance
with it, by itself, is not much more useful than well-formedness, from a
validation point of view.

-dan


From larsga@ifi.uio.no  Wed Jun  2 09:52:06 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 02 Jun 1999 10:52:06 +0200
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <3754E5B9.96A9FD54@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com>
Message-ID: <wkpv3f6l3d.fsf@ifi.uio.no>

Hi Dan,

* Dan Libby
|
| I'm sure you'll be glad to hear that we are doing our validation
| with python and the excellent XML libraries you all have contributed
| to.

I certainly am glad to hear that! I'm also glad to see actual live
Netscape representatives in a public forum, since I've been wanting to
discuss RSS with you.

And although I say a lot of negative stuff about RSS below I'd like to
congratulate you on the first successful global XML web application.
There are so many RSS documents on the web now, and quite a bit of
software, so I don't think there's any question that this honour
belongs to you. I also don't think there's any question that a major
part of the reason is that RSS is so simple.

I use my RSS client every day now and am very happy with it. I just
wish everyone whose pages I'm interested in would provide RSS feeds,
and I will probably start asking for it pretty soon.
 
| FYI, the current validator is very specific.  It understands the
| "0.9" format intimately at the code level.

This is definitely a good idea. Sadly, though, many of the RSS files
on the net are not even well-formed. The ones for WebMonkey and
python.org spring to mind.

| However, in my spare time I've been working on a generic validator
| that will read in a schema file (of my own devise, not a real XML
| schema) that's written in XML, and then validate a document based on
| that. 

Hmmm. Why not use a real XML schema? It should support everything I
can imagine you would want anyway. Or is it too complex?

| Hopefully I can get it installed soon, and possibly even distribute
| the source, such as it is.  (This is my first Python + first DOM
| coding project).  

It would be great if you did.

| This seems like a pretty obvious thing to me, I'm surprised that XML
| has gotten as far as it has without real support for enforcing data
| types, lengths, ranges, etc.

I can just hear the functional programming freaks (Standard ML,
Haskell and all that) say the same thing about Python. :-)

Seriously, these things aren't as important as many people think. And
it's also worth remembering that XML comes from a document background
where such things are not all that relevant. (Imagine trying to do
this for HTML. Actually enforcing correct use of DFN, H1-H6, ABBR,
ACRONYM, VAR, ADDRESS and all the other elements would require a
serious number of years of AI development in Prolog or Common Lisp.)

| What would you like to see / not see in the format?  It really is
| just supposed to be a summary. 

The first thing I'd like to see is a date element for items. Many RSS
providers currently use something like:

  <item>
    <title>(19990602) New foo!</title>
    <link>...

and it would be useful to formalize that as:

  <item>
    <date>19990602</date>
    <title>...

The second thing is descriptions for items. I'm thinking of providing
an RSS feed for my home page, and when I do I know I will want to be
able to have entries like:

  <item>
    <date>19990602</date>
    <title>RSS feed available!</title>
    <description>I now provide an RSS feed which lists all updates to
    my home page. This will hopefully make it easier for people

A third thing is a place to put the email address of the maintainer so
that I know where to complain when a document isn't well-formed.

There's probably more as well, which I'll think of the moment I send
this. If you want discussion about what RSS should and shouldn't
contain I'd recommend you to try to start it here or over at xml-dev.
(I know Dave Winer has a lot of ideas for it

| Ideally, we would like to support all of Dublin Core eventually, but
| the problem is that the additional data may not actually be used,
| and marketing folks felt it would be simpler to not confuse folks
| too much.
 
I came to pretty much the same conclusion with XSA (see below) and
then discovered that the difficult stuff was needed anyway. But I
still think this is the right way to go:

  - make a simple version and put it out
  - wait for widespread acceptance and lots of implementations
  - then add all the difficult stuff and make it optional (In your
  case: why not make a CGI wizard like I did with XSA, and add a link
  from the RSS guide to the more fancy options?)

In any case, this isn't a new idea, since this is exactly what C, Unix
and C++ have done (to some extent also SAX and XML) and it seems to
work better than the opposite approach, favoured by many little-known
technologies (such as SGML).

* Lars Marius Garshol
|
| (warning: the RSS guide is not very accurate technically)

* Dan Libby
|
| What in particular did you find that was inaccurate?  

Here's a quick list:

  - The guide says: "Name your file using the .rdf suffix, unless you
  are generating your file dynamically using a .cgi or other
  program. Netscape recommends the use of the .rdf filename suffix,
  but does not require it."

  Well, on the web it's the MIME type that counts, so the guide should
  give the correct MIME type and then some hints on how to get it
  right. The suffix is just an ugly trick to get the right MIME type
  on correctly configured servers.

  - "RSS 0.9 supports the full ASCII character set, as well as all
  legal decimal and HTML entities. RSS 0.9 does not support other
  types of character data, such as UTF-8. For a list of legal HTML and
  decimal entities, refer to Special Symbols and Entities on DevEdge,
  Netscape's information resource for developers."

  Well, XML uses Unicode, but I suppose applications can be more
  restrictive. However, you cannot use HTML entities in XML without
  declaring them, and since there is no RSS DTD any RSS file that uses
  an HTML entity is not well-formed.

  - '<?xml version="1.0"?>'

  If you use US-ASCII you might as well declare that you're doing so
  with an encoding declaration. (Parsers may then complain if you
  don't conform to your own declaration.)

  - Also, what's the relationship with RDF? RSS uses the RDF root
  element, but does not conform to the RDF syntax or actually use
  anything meaningful from RDF.

| This brings me to another question.  Do you all believe it is the
| "right thing" to publish a DTD for a format, even if the DTD by
| itself is not sufficient to validate the document?

Yes! A DTD is useful in that it allows you to do at least some
validation, and it's also very useful as a statement of intent (that
is, as documentation). For example, when reading the RSS guide it's
impossible to tell whether one or more textinput elements are allowed
and where they are allowed. The same goes for the image element.

This is the RSS DTD I currently have in my CVS tree. However, I have
no idea whether it's correct or not. For example, I've seen userland.com
use the image element as a special kind of item, so maybe the rdf:RDF
element should have (channel, (image | item)+, textinput?).

<!--

  A tentative DTD for RSS 0.9.

  This DTD has no official standing, it's just reverse-engineered from
  the RSS guide published by Netscape.
    
  Lars Marius Garshol - larsga@ifi.uio.no.
  $Id: rss-0.9.dtd,v 1.1 1999/05/24 20:54:04 larsga Exp $
  
-->

<!ELEMENT rdf:RDF (channel, image?, item+, textinput?)>
<!ATTLIST rdf:RDF
          xmlns:rdf CDATA #FIXED "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
          xmlns     CDATA #FIXED "http://my.netscape.com/rdf/simple/0.9/">

<!ELEMENT channel (title, description, link)>

<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT link (#PCDATA)>


<!ELEMENT image (title, url, link)>

<!ELEMENT url (#PCDATA)>


<!ELEMENT item (title, link)>


<!ELEMENT textinput (title, description, name, link)>

<!ELEMENT name (#PCDATA)>


| In other words, an XML editor application referencing the DTD would
| allow the user to construct a document that is non-valid with
| regards to our rules.  It seems to me that the DTD then becomes
| something of a distraction, because compliance with it, by itself,
| is not much more useful than well-formedness, from a validation
| point of view.

It's useful in that it provides more information for content providers
and software developers, and in that it's 100% unambiguous. It's also
useful for you when doing validation with custom-written tools, since
you won't have to worry about where elements occur.

I've done exactly the same for XSA and have exactly the same problem
as you. I provided a DTD and have special validating software that
rides on top of a validator (xmlproc). If I were to do it again
there's no question that I would do the same thing. So far there has
been no confusion at all (although I've seen HTML users become
confused by this).

See <URL: http://birk105.studby.uio.no/www_work/xsa/> for more info.

--Lars M.


From michel.plu@cnet.francetelecom.fr  Wed Jun  2 10:06:56 1999
From: michel.plu@cnet.francetelecom.fr (PLU Michel CNET/DSM/LAN)
Date: Wed, 2 Jun 1999 11:06:56 +0200
Subject: [XML-SIG] accessing to xml doc element
Message-ID: <B932E841DDE0D0119D2300609759036C6CE4DE@l-mhs4.lannion.cnet.fr>


Is there a python xml code for accessing element of an HTML or XML document
in a way like

myDocument.documentElement.body[0].center[0].table[2].tr[0].td[0].table[1].t
r[0].td[1].font[0].h4[0].pcdata[0]

the idea is to use the dom interface of a document and to define the
__getattr__ (tag)  method of the Node  class in the xml.dom.core module in
order to calls the method getElementByTagName(tag).

Unfortunetly this method is already defined and personally modifying it is
not a clean solution 

	any idea ?

		
		Michel

Ps: Please reply me directly since i did not subscribe to the mailing list


From larsga@ifi.uio.no  Wed Jun  2 10:57:32 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 02 Jun 1999 11:57:32 +0200
Subject: [XML-SIG] accessing to xml doc element
In-Reply-To: <B932E841DDE0D0119D2300609759036C6CE4DE@l-mhs4.lannion.cnet.fr>
References: <B932E841DDE0D0119D2300609759036C6CE4DE@l-mhs4.lannion.cnet.fr>
Message-ID: <wkogiy7wmr.fsf@ifi.uio.no>

* PLU Michel
|
| Is there a python xml code for accessing element of an HTML or XML document
| in a way like
| 
| myDocument.documentElement.body[0].center[0].table[2].tr[0].td[0].table[1].t
| r[0].td[1].font[0].h4[0].pcdata[0]

XPointers can do this. My PyPointers does this, but is currently not
updated to the latest PyDOM version. However, 4DOM comes with a
version that works with 4DOM.

I plan to update PyPointers, but with SAX2, easySAX, JPython SAX and
my thesis hanging over me there isn't that much time for it...
 
--Lars M.


From Jacco.van.Ossenbruggen@cwi.nl  Wed Jun  2 15:05:02 1999
From: Jacco.van.Ossenbruggen@cwi.nl (J.R. van Ossenbruggen)
Date: Wed, 02 Jun 1999 16:05:02 +0200
Subject: [XML-SIG] patch to xml/dom/esis_builder.py
In-Reply-To: Your message of "Tue, 01 Jun 1999 14:37:38 MET DST."
 <14164.10354.491850.334248@weyr.cnri.reston.va.us>
Message-ID: <UTC199906021405.QAA01309.jrvosse@klipper.cwi.nl>

On Tue, Jun 1 1999 "Fred L. Drake" wrote:

>   I like this, but have two changes.  First, the default convert
> function could be str instead of a lambda; this would be faster since
> str() is implemented in C (or Java in JPython).

Agreed.

>   The second change concerns this part of the patch:
> 
>  > !                 name = self.convert(l[0])
>  > !                 if l[1] == 'IMPLIED':
>  > !                     # fix this. Needs to be undefined attr
>  > !                     value = ''
>  > !                 else:
>  > !                     value = ESISDecode(l[2])
>  >                   self.attr_store[name] = value
> 
>   This could be something like this:
> 
> 	if l[1] != 'IMPLIED':
> 	    self.attr_store[self.convert(l[0])] = ESISDecode(l[2])
> 
> This does just as much as needed, and doesn't create the bogus
> attribute entry in the dictionary.

You're right again. I was under the impression #IMPLIED attributes
should create a bogus attribute with specified=false.  I just reread
the spec to see that this impression was false.  Thanks a lot!

	Jacco

PS: a new version of the patch with Fred's changes:

Index: esis_builder.py
===================================================================
RCS file: /projects/cvsroot/xml/dom/esis_builder.py,v
retrieving revision 1.5
diff -c -r1.5 esis_builder.py
*** esis_builder.py	1999/03/18 12:38:28	1.5
--- esis_builder.py	1999/06/02 13:04:26
***************
*** 27,37 ****
  
  class EsisBuilder(Builder):
      
!     def __init__(self):
          Builder.__init__(self)
          self.attr_store = {}
          self.id_store = {}
          #self.sdata_handler = handle_sdata
  
      def feed(self, data):
          for line in string.split(data, '\n'):
--- 27,39 ----
  
  class EsisBuilder(Builder):
      
!     def __init__(self, convert=str):
          Builder.__init__(self)
          self.attr_store = {}
          self.id_store = {}
          #self.sdata_handler = handle_sdata
+         # convert may, for example, be used to handle case conversion 
+         self.convert = convert
  
      def feed(self, data):
          for line in string.split(data, '\n'):
***************
*** 41,46 ****
--- 43,49 ----
              text = line[1:]
  
              if event == '(':
+                 text = self.convert(text)
                  element = self.document.createElement(text, self.attr_store)
                  self.attr_store = {}
                  self.push(element)
***************
*** 50,58 ****
  
              elif event == 'A':
                  l = re.split(' ', text, 2)
!                 name = l[0]
!                 value = ESISDecode(l[2])
!                 self.attr_store[name] = value
  
              elif event == '-':
                  text = self.document.createText(ESISDecode(text))
--- 53,61 ----
  
              elif event == 'A':
                  l = re.split(' ', text, 2)
!                 name = self.convert(l[0])
!                 if l[1] != 'IMPLIED':
! 			self.attr_store[self.convert(l[0])] = ESISDecode(l[2])
  
              elif event == '-':
                  text = self.document.createText(ESISDecode(text))


From danda@netscape.com  Thu Jun  3 00:14:23 1999
From: danda@netscape.com (Dan Libby)
Date: Wed, 02 Jun 1999 16:14:23 -0700
Subject: [XML-SIG] Re: RSS and stuff
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no>
Message-ID: <3755BACF.BDC54B15@netscape.com>

This is a multi-part message in MIME format.
--------------1E252B7D3F134ACBF841820B
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Lars,

Thanks for your response.  I have forwarded it to others here who are involved
with RSS.  Below are my responses.

> This is definitely a good idea. Sadly, though, many of the RSS files
> on the net are not even well-formed. The ones for WebMonkey and
> python.org spring to mind.
>

I assume you mean they are not well-formed because they embed entities?


> | However, in my spare time I've been working on a generic validator
> | that will read in a schema file (of my own devise, not a real XML
> | schema) that's written in XML, and then validate a document based on
> | that.
>
> Hmmm. Why not use a real XML schema? It should support everything I
> can imagine you would want anyway. Or is it too complex?

1) It's a spec.  A very complex spec.  I don't know of any software that
implements it.  I don't have time to write such software, given our development
schedules which are measured in days.  I just want something that is flexible
enough that we can change our format without having to write a bunch of new code.
When XML schemas are well supported, then we should be able to move to those quite
easily, provided they have a superset of our functionality.  Besides, if I tried,
I would probably end up with something that is close to XML schemas, but not
exact, so then we have unexpected behavior, etc.  This way, it is obviously not an
xml schema, just "Dan's validation rules" DTD.

2) I may have just missed it, but I didn't see any support for limiting length of
strings.

3) The time support is IS0 8601 only, which is itself a very complicated subject.
(aside: anyone know of a python module to parse dates according to 8601?).  I
would like to see support for unix/c style integer timestamps (seconds since 1970
UNC, as returned by time() ).  We tend to use these a lot.  Also for unix/c style
date string as returned by `date`.  eg: Sun May 30 19:24:15 PDT 1999.  I already
forwarded this request to the xml schema folks.


> Seriously, these things aren't as important as many people think. And
> it's also worth remembering that XML comes from a document background
> where such things are not all that relevant. (Imagine trying to do
> this for HTML. Actually enforcing correct use of DFN, H1-H6, ABBR,
> ACRONYM, VAR, ADDRESS and all the other elements would require a
> serious number of years of AI development in Prolog or Common Lisp.)
>

They are important to us. We need to store this stuff in a database.  We need to
make sure some joker hasn't given us a string that is 20 megabytes long, and
further that we won't be putting HTML into our generated page that breaks the
entire page.  We also need to be able to tell end-users (webmasters) whether the
data they have given us will actually be displayed correctly or not.  I think that
as XML becomes used for data transfer, as opposed to document transfer, people
will be more and more concerned about this.  E-commerce especially is going to
require a very specific set of enforceable rules for validity.  For some reason,
people tend to become very upset when money is involved.  ;-)

>
> | What would you like to see / not see in the format?  It really is
> | just supposed to be a summary.
>
> The first thing I'd like to see is a date element for items. Many RSS
> providers currently use something like:
>
>   <item>
>     <title>(19990602) New foo!</title>
>     <link>...
>
> and it would be useful to formalize that as:
>
>   <item>
>     <date>19990602</date>
>     <title>...
>

Agreed.  I had this in the original spec, but was removed for public release,
since we were not actually going to use the value.  What do you think of
<timestamp>  (seconds since 1970) </timestamp>  instead?  Again, I'm not fond of
parsing IS0 8601.


> The second thing is descriptions for items. I'm thinking of providing
> an RSS feed for my home page, and when I do I know I will want to be
> able to have entries like:
>
>   <item>
>     <date>19990602</date>
>     <title>RSS feed available!</title>
>     <description>I now provide an RSS feed which lists all updates to
>     my home page. This will hopefully make it easier for people
>

This should be possible.  Again, we didn't support stuff like this originally,
because will not actually use the data in the "description" tag anywhere on My
Netscape, and because our (old) validator code had to know about description rules
for each location it is used. As others are now using the format, I can see where
it would make sense, and it should be easy to add this as an optional element if I
can convince people to use my new validation code.

> A third thing is a place to put the email address of the maintainer so
> that I know where to complain when a document isn't well-formed.
>

hmm.  I assume you think this should be inside the <channel> tag?  This is where
<dc:creator> would be nice...

>   - "RSS 0.9 supports the full ASCII character set, as well as all
>   legal decimal and HTML entities. RSS 0.9 does not support other
>   types of character data, such as UTF-8. For a list of legal HTML and
>   decimal entities, refer to Special Symbols and Entities on DevEdge,
>   Netscape's information resource for developers."
>

We are updating this to support UTF-8 soon, and possibly other encodings.  I
promise to post a DTD soon.  ;-)


>   - Also, what's the relationship with RDF? RSS uses the RDF root
>   element, but does not conform to the RDF syntax or actually use
>   anything meaningful from RDF.

This boils down to internal politics.  If you click on the "Future Directions"
link in the quickstart (http://my.netscape.com/publish/help/futures.html), I have
an example of the original RSS format I came up with, which does make meaningful
use of RDF (channels have IDs, all nodes connect, dublin core is used, etc.)
However, apparently this "overly complicated".  There are other technical reasons
I can't really go into.  Anyway, for now, RSS is basically an XML format, and it
may eventually have an RDF superset.

[regarding posted RSS DTD]

 Thanks.  I'll take a look at this, run it through a validating parser, etc.  Do
you mind if we post it, or a slightly modified version, as the "official" DTD?


> <!ELEMENT channel (title, description, link)>

This implies ordering, correct?  ie, title, then description, then link?  A
problem I had with DTDs is that I couldn't figure out how to say that an element
is required, and that ordering is unimportant.  Therefore, if I posted this DTD
now, it would mean that a whole bunch of existing channels are invalid.  The other
option is to use (title | description | link), but this means that they are
optional, which is even less correct.


> I've done exactly the same for XSA and have exactly the same problem
> as you. I provided a DTD and have special validating software that
> rides on top of a validator (xmlproc). If I were to do it again
> there's no question that I would do the same thing. So far there has
> been no confusion at all (although I've seen HTML users become
> confused by this).

What is this special validating software?  Is it generic, or does it know
specifically about your format?  If generic, what do you use as input to define
the validaton rules?  My apologies if this is all explained in detail somwhere...
;-)

-dan

--------------1E252B7D3F134ACBF841820B
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------1E252B7D3F134ACBF841820B--


From wunder@infoseek.com  Thu Jun  3 01:04:20 1999
From: wunder@infoseek.com (Walter Underwood)
Date: Wed, 02 Jun 1999 17:04:20 -0700
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <3755BACF.BDC54B15@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com>
 <wkpv3f6l3d.fsf@ifi.uio.no>
Message-ID: <3.0.5.32.19990602170420.00a96990@corp>

At 04:14 PM 6/2/99 -0700, Dan Libby wrote:
>
>3) The time support is IS0 8601 only, which is itself a very complicated subject.
>(aside: anyone know of a python module to parse dates according to 8601?).  I
>would like to see support for unix/c style integer timestamps (seconds since 1970
>UNC, as returned by time() ). 

It's not that bad. Insist on the web profile of ISO 8601 and 
there are only five formats. Do an sscanf or re.match for
each format, and when one converts, do time.mktime() with what
you've just parsed.

And let's try to avoid using seconds-since-the-epoch in external
formats. We're just now doing the Y2K thing, so I don't think it
is a good idea to use formats that fall apart in 2037.

Here is what it takes to convert a Unix timestamp to ISO 8601:

def make_date(timeint):
    return time.strftime('%Y-%m-%dT%H:%M:%SZ',time.gmtime(timeint))

wunder

--
Walter R. Underwood
wunder@infoseek.com
wunder@best.com (home)
http://software.infoseek.com/cce/ (my product)
http://www.best.com/~wunder/
1-408-543-6946


From danda@netscape.com  Thu Jun  3 02:04:22 1999
From: danda@netscape.com (Dan Libby)
Date: Wed, 02 Jun 1999 18:04:22 -0700
Subject: [XML-SIG] Re: RSS and stuff
References: <3754E5B9.96A9FD54@netscape.com>
 <wkpv3f6l3d.fsf@ifi.uio.no> <3.0.5.32.19990602170420.00a96990@corp>
Message-ID: <3755D495.112CA66F@netscape.com>

This is a multi-part message in MIME format.
--------------630938801F57E97C1046FD98
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Walter, thanks for the code example.


> And let's try to avoid using seconds-since-the-epoch in external
> formats. We're just now doing the Y2K thing, so I don't think it
> is a good idea to use formats that fall apart in 2037.

I thought it was 2038.  ;-)   Seems like we should all be using long longs by then -
greater than 32 bits anyway, so I'm not sure it is such a big problem.  Anyway, the
nice thing about the integer is that they are guaranteed accurate to the second. With
ISO 8601, the receiver needs to round (nearest day, hour, minute, second).  Besides,
if unix breaks then, people are gonna have bigger worries than RSS displaying 1970.

> Here is what it takes to convert a Unix timestamp to ISO 8601:
>
> def make_date(timeint):
>     return time.strftime('%Y-%m-%dT%H:%M:%SZ',time.gmtime(timeint))
>

Right, but my thinking is that it is easier for people if we just support it natively
than if they have to figure out how to do that in their sed script or whatever.


--------------630938801F57E97C1046FD98
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------630938801F57E97C1046FD98--


From akuchlin@mems-exchange.org  Thu Jun  3 14:14:37 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Thu, 3 Jun 1999 09:14:37 -0400 (EDT)
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <3755BACF.BDC54B15@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com>
 <wkpv3f6l3d.fsf@ifi.uio.no>
 <3755BACF.BDC54B15@netscape.com>
Message-ID: <14166.32701.896825.371306@amarok.cnri.reston.va.us>

Dan Libby writes:
>(aside: anyone know of a python module to parse dates according to 8601?).  

	 The XML package contains xml.utils.iso8601.py, contributed by 
Fred Drake.  

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
America is a country that doesn't know where it is going but is determined to
set a speed record getting there.
    -- Laurence J. Peter


From wunder@infoseek.com  Fri Jun  4 16:49:44 1999
From: wunder@infoseek.com (Walter Underwood)
Date: Fri, 04 Jun 1999 08:49:44 -0700
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <3755D495.112CA66F@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com>
 <wkpv3f6l3d.fsf@ifi.uio.no>
 <3.0.5.32.19990602170420.00a96990@corp>
Message-ID: <3.0.5.32.19990604084944.00aa0340@corp>

At 06:04 PM 6/2/99 -0700, Dan Libby wrote:
>Walter, thanks for the code example.
>
>> And let's try to avoid using seconds-since-the-epoch in external
>> formats. We're just now doing the Y2K thing, so I don't think it
>> is a good idea to use formats that fall apart in 2037.
>
>I thought it was 2038.  ;-)   Seems like we should all be using 
>long longs by then - greater than 32 bits anyway, so I'm not sure 
>it is such a big problem.

We've been parsing dates for date search in our engine, and the
Unix timestamp has real problems. No time zone, for example.

>Anyway, the nice thing about the integer is that they are guaranteed 
>accurate to the second. With ISO 8601, the receiver needs to round 
>(nearest day, hour, minute, second).

With the timestamp, does the number of seconds include all the 
leap seconds since 1970? It should, but does it? Does Apache on
Amiga do the right thing? To be pedantic, the Unix timestamp
format is precise but may not be accurate.

Lots of content has a meaningful precision other than one second.
Press Releases are on a certain day. Books are published in a
particular month. Forcing meaningless precision on those things 
is a mistake.

Finally, the seconds thing totally falls apart if you need to express
dates outside it's tiny range: photograph taken in 1893, an HP atomic 
clock app note written in 1964, etc.

Internally, the right way to handle this is to carry a precision 
along with the time. DCE has some routines to do this. The DCE
Time Services Spec is listed here, but it's not free:

  http://www.opengroup.org/public/pubs/catalog/c310.htm

I'll see if I can hunt down some non-pay man pages.

wunder
--
Walter R. Underwood
wunder@infoseek.com
wunder@best.com (home)
http://software.infoseek.com/cce/ (my product)
http://www.best.com/~wunder/
1-408-543-6946


From danda@netscape.com  Fri Jun  4 22:32:23 1999
From: danda@netscape.com (Dan Libby)
Date: Fri, 04 Jun 1999 14:32:23 -0700
Subject: [XML-SIG] Re: RSS and stuff
References: <3754E5B9.96A9FD54@netscape.com>
 <wkpv3f6l3d.fsf@ifi.uio.no>
 <3.0.5.32.19990602170420.00a96990@corp> <3.0.5.32.19990604084944.00aa0340@corp>
Message-ID: <375845E7.3D85145@netscape.com>

This is a multi-part message in MIME format.
--------------EE201692D10CF2927E688DF2
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

> We've been parsing dates for date search in our engine, and the
> Unix timestamp has real problems. No time zone, for example.
>

It is understood to be in UNC.  If you need to convert, you do so with
localtime() or equivalent.

> Lots of content has a meaningful precision other than one second.
> Press Releases are on a certain day. Books are published in a
> particular month. Forcing meaningless precision on those things
> is a mistake.

In general, I prefer "too much" precision to too little.  For example, if
we need to display
a timestamp for when this article was created in a consistent notation,
we may include all the way down to the minute. If they have given us
something like "June 1999", it places the onus on we, the receiver to
round to the nearest day, hour, minute, second.  The timestamp method
places it on the sender, who should know more accurately.

> Finally, the seconds thing totally falls apart if you need to express
> dates outside it's tiny range: photograph taken in 1893, an HP atomic
> clock app note written in 1964, etc.

True.... but not many web pages were created before 1970, and this format
is supposed to be describing web pages. (Site Summary)

What do you think about this for a compromise, two different tags:
<date>ISO 6501</date>
<timestamp>seconds since 1970, UNC</timestamp>

Or alternatively:
<date format="timestamp">...</date>
<date format="IS0-6501">...</date>


--------------EE201692D10CF2927E688DF2
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------EE201692D10CF2927E688DF2--


From larsga@ifi.uio.no  Sat Jun  5 11:50:00 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 05 Jun 1999 12:50:00 +0200
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <3755BACF.BDC54B15@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com>
Message-ID: <wkvhd253c7.fsf@ifi.uio.no>

* Lars Marius Garshol
|
| Sadly, though, many of the RSS files on the net are not even
| well-formed. The ones for WebMonkey and python.org spring to mind.

* Dan Libby
| 
| I assume you mean they are not well-formed because they embed
| entities?

Actually, no. The WebMonkey file is not well-formed because the XML
declaration does not begin the document (if they removed it all would
be well; I've emailed them, but to no avail) and the python.org one is
not well-formed because it has a <link>...</linK> pair.

* Lars Marius Garshol
|
| Hmmm. Why not use a real XML schema? It should support everything I
| can imagine you would want anyway. Or is it too complex?
 
* Dan Libby
|
| 1) It's a spec.  A very complex spec.  I don't know of any software
| that implements it.  I don't have time to write such software, given
| our development schedules which are measured in days.  I just want
| something that is flexible enough that we can change our format
| without having to write a bunch of new code.  When XML schemas are
| well supported, then we should be able to move to those quite
| easily, provided they have a superset of our functionality.
| Besides, if I tried, I would probably end up with something that is
| close to XML schemas, but not exact, so then we have unexpected
| behavior, etc.  This way, it is obviously not an xml schema, just
| "Dan's validation rules" DTD.
 
| 2) I may have just missed it, but I didn't see any support for
| limiting length of strings.

I don't think there is any.
 
| 3) The time support is IS0 8601 only, which is itself a very
| complicated subject.

Walter Underwood and AMK have already dealt with this, so I'll just
skip it here.

* Lars Marius Garshol
|
| [on the topic of XML and data typing]
|
| Seriously, these things aren't as important as many people think.
| And it's also worth remembering that XML comes from a document
| background where such things are not all that relevant.
 
* Dan Libby
|
| They are important to us. We need to store this stuff in a database.
| We need to make sure some joker hasn't given us a string that is 20
| megabytes long,

Sure, but in the original SGML context this wasn't a problem in the
same way.

| I think that as XML becomes used for data transfer, as opposed to
| document transfer, people will be more and more concerned about
| this.  E-commerce especially is going to require a very specific set
| of enforceable rules for validity.  

Definitely, and for this very reason I've been advocating that the W3C
schema language should be extensible, so that the e-commerce and EDI
communities (and other communities with special needs) can build on
what's already defined.

| For some reason, people tend to become very upset when money is
| involved.  ;-)

Strange. Can't think why that would be. :)

| [dates in RSS]
| 
| Agreed.  I had this in the original spec, but was removed for public
| release, since we were not actually going to use the value.  What do
| you think of <timestamp> (seconds since 1970) </timestamp> instead?

I don't like it. Most people will be authoring RSS by hand or generate
it automatically from some hand-written source. When writing RSS by
hand seconds since 1970 is out of the question and when generating it
with XSL I don't think this transformation is possible.

Also, seconds since 1970 is not human-readable or intuitive in any way.

| Again, I'm not fond of parsing IS0 8601.

A simple requirement like YYYYMMDD would be sufficient, I think. (Even
not requiring anything at all should be acceptable, but in this case
YYYYMMDD might be the best choice.)
 
| [item descriptions in RSS]
| 
| This should be possible.  Again, we didn't support stuff like this
| originally, because will not actually use the data in the
| "description" tag anywhere on My Netscape, and because our (old)
| validator code had to know about description rules for each location
| it is used. As others are now using the format, I can see where it
| would make sense, and it should be easy to add this as an optional
| element if I can convince people to use my new validation code.

Good! I'm crossing my fingers here. :)
 
* Lars Marius Garshol
|
| A third thing is a place to put the email address of the maintainer so
| that I know where to complain when a document isn't well-formed.
 
* Dan Libby
|
| hmm.  I assume you think this should be inside the <channel> tag?

Yes.

| This is where <dc:creator> would be nice...

Ouch, no. <contact-email>, perhaps. Dublin Core doesn't mandate the
syntax of DC element contents, but using the email address here
doesn't feel very right.

Also: one thing I detest about this use of namespaces is that it gives
you no choice in naming (except in the prefix, which I don't think
should be abused). Something like:

  <!ATTLIST contact-email dublin-core CDATA #FIXED "creator">

would be much better.
 
* Lars Marius Garshol
|
| - "RSS 0.9 supports the full ASCII character set, as well as all
| legal decimal and HTML entities. RSS 0.9 does not support other
| types of character data, such as UTF-8. For a list of legal HTML and
| decimal entities, refer to Special Symbols and Entities on DevEdge,
| Netscape's information resource for developers."
 
* Dan Libby
|
| We are updating this to support UTF-8 soon, and possibly other
| encodings.  

Hmmm. Which parser(s) are you using?

| I promise to post a DTD soon.  ;-)

Good. :)

| [RSS and RDF] 
| 
| If you click on the "Future Directions" link in the quickstart
| (http://my.netscape.com/publish/help/futures.html), I have an
| example of the original RSS format I came up with, which does make
| meaningful use of RDF (channels have IDs, all nodes connect, dublin
| core is used, etc.)

Hmmm. Maybe there's something about RDF I've missed, but this doesn't
appear to be correct RDF either. Shouldn't the RDF document be just a
sequence of RDF statements, with custom elements inside the statements?

| However, apparently this "overly complicated". 

I think that's correct. Do you think this proposal would have caught
on the way RSS 0.9 has? (Sometimes I think we should all re-read
worse-is-better every morning. :)

| [regarding posted RSS DTD]
| 
| Thanks.  I'll take a look at this, run it through a validating
| parser, etc.  Do you mind if we post it, or a slightly modified
| version, as the "official" DTD?

Not at all. Does this mean that I captured your view of RSS correctly?
 
* Lars Marius Garshol
|
| <!ELEMENT channel (title, description, link)>
 
* Dan Libby
|
| This implies ordering, correct?  ie, title, then description, then
| link?  

Yes.

| A problem I had with DTDs is that I couldn't figure out how to say
| that an element is required, and that ordering is unimportant.

In XML there isn't any. Schemas currently allow this, as do SGML DTDs.
You can do it by explicitly allowing choices between all the possible
different sequences, but for n elements the number of sequences equals
n factorial.

| Therefore, if I posted this DTD now, it would mean that a whole
| bunch of existing channels are invalid.  

Ouch. Not good.

However, why did you allow any ordering? If the order doesn't matter
it may as well be fixed, especially as this causes much less pain in
specifying a DTD. I don't see the harm anywhere either.

| The other option is to use (title | description | link), but this
| means that they are optional, which is even less correct.

I agree, this is an ugly problem, but it's mainly caused by being
insufficently restrictive to begin with. 

| [XSA custom validator]
| 
| What is this special validating software?  Is it generic, or does it
| know specifically about your format? If generic, what do you use as
| input to define the validaton rules? 

I use a DTD as a declarative means of specifying the hard bits
(allowed elements and nesting), and then Python code to deal with
element content typing. (This is not generic at the moment. After
reading the XML schemas draft I'm working on an implementation of the
data types part which would be completely generic and not even depend
on schemas.) Since the DTD handles everything except element content
this works well and is really easy.

Also, the DTD works well as documentation and people can also use it
to guide XML-aware editors and so on.

| My apologies if this is all explained in detail somwhere...  ;-)

It's not. :)

--Lars M.


From danda@netscape.com  Sat Jun  5 22:48:30 1999
From: danda@netscape.com (Dan Libby)
Date: Sat, 05 Jun 1999 14:48:30 -0700
Subject: [XML-SIG] Re: RSS and stuff
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no>
Message-ID: <37599B2E.A238ED59@netscape.com>

This is a multi-part message in MIME format.
--------------E13B6852273E859CE3A7C91D
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

> * Dan Libby
> |
> | They are important to us. We need to store this stuff in a database.
> | We need to make sure some joker hasn't given us a string that is 20
> | megabytes long,
>
> Sure, but in the original SGML context this wasn't a problem in the
> same way.
>

I'm not sure what you mean here.  Why wasn't it a problem?  Probably because
people were using SGML to transfer "documents", rather than "data", and
possibly because the publishers were always trusted?

> Definitely, and for this very reason I've been advocating that the W3C
> schema language should be extensible, so that the e-commerce and EDI
> communities (and other communities with special needs) can build on
> what's already defined.

Yes!

> | This is where <dc:creator> would be nice...
>

> Ouch, no. <contact-email>, perhaps. Dublin Core doesn't mandate the
> syntax of DC element contents, but using the email address here
> doesn't feel very right.
>

Well, if used correctly in an RDF context, <dc:creator> would just be an
arc-label that refers to a node that represents you.  That node would have
other arc-labels named eg: email-address, first name, last name, country,
etc.

> | We are updating this to support UTF-8 soon, and possibly other
> | encodings.
>
> Hmmm. Which parser(s) are you using?
>

Errg.  xmlproc.  (XMLValParserFactory.make_parser()).  I've been talking to
Jose, our i18n guy, and it sounds like Python is not internally UTF-8
compliant, but he isn't concerned for some reason...

> | If you click on the "Future Directions" link in the quickstart
> | (http://my.netscape.com/publish/help/futures.html), I have an
> | example of the original RSS format I came up with, which does make
> | meaningful use of RDF (channels have IDs, all nodes connect, dublin
> | core is used, etc.)
>
> Hmmm. Maybe there's something about RDF I've missed, but this doesn't
> appear to be correct RDF either. Shouldn't the RDF document be just a
> sequence of RDF statements, with custom elements inside the statements?
>

RDF is about a directed labelled graph.  As long as you comply with that
data model, the actual name of the elements (vocabulary) is secondary. Check
out some of the .rdf's at mozilla.org.  It looks similar.  Also, I validated
it with SirPac (http://www.w3.org/RDF/Implementations/SiRPAC/) and with our
chief rdf guru, guha (whose name appears on xml-schema docs, etc).  (Note:
the version of Sirpac currently installed has a bug that causes the
visualized graph to be disconnected)

> Not at all. Does this mean that I captured your view of RSS correctly?

Well, close.  I removed the ordering dependencies you had, and also added
support for HTML 3.2 entities.  I'll post a draft soon.

-dan


--------------E13B6852273E859CE3A7C91D
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------E13B6852273E859CE3A7C91D--


From danda@netscape.com  Mon Jun  7 04:10:57 1999
From: danda@netscape.com (Dan Libby)
Date: Sun, 06 Jun 1999 20:10:57 -0700
Subject: [XML-SIG] xmlproc, dtd's, and such
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com>
Message-ID: <375B3840.68F36912@netscape.com>

This is a multi-part message in MIME format.
--------------E992E119899C2F89F9E0AF08
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Okay, so I'm using xmlproc for some DTD based validation.  However, I don't want
to go off to the network every time I have to validate a new file, which means I
will have to cache locally somehow.

I saw an earlier thread on this topic which seemed to indicate that this should
be easy, but it didn't actually elaborate.  Can anyone tell me specifically what
class/methods to override?

One approach I might imagine is that the parser would call some sort of
openDTD() function that I could override.  In there, I would have a map from the
public DTD url to a local file.  Alternatively, there could already be some
pre-built caching code.  Further, ideally, I would like to do all this through
the sax interface in a non parser specific manner.  Is that asking too much?

-dan

--------------E992E119899C2F89F9E0AF08
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------E992E119899C2F89F9E0AF08--


From larsga@ifi.uio.no  Mon Jun  7 06:57:11 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 07 Jun 1999 07:57:11 +0200
Subject: [XML-SIG] xmlproc, dtd's, and such
In-Reply-To: <375B3840.68F36912@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com>
Message-ID: <wkk8tgh7t4.fsf@ifi.uio.no>

* Dan Libby
| 
| Okay, so I'm using xmlproc for some DTD based validation.  However,
| I don't want to go off to the network every time I have to validate
| a new file, which means I will have to cache locally somehow.

Basically, what determines where xmlproc will look for the DTD is the
document itself and the public and system identifiers in the DOCTYPE
declaration.

If you set those correctly, xmlproc will look for the DTD where you
want. You can also use a catalog file to control the resolution of the
public identifier, but in SAX 1.0 there is no standard way to give the
parser a pointer to the catalog file.

If you don't trust the system and public identifiers and want to
control this yourself you can use the EntityResolver interface.
Here's an example:

from xml.sax import saxexts

class EntityResolver:

    def resolveEntity(self, publicId, systemId):
        print "PUBID: "+`publicId`+"\tSYSID: "+`systemId`
        return systemId

parser=saxexts.make_parser("xml.sax.drivers.drv_xmlproc_val")
parser.setEntityResolver(EntityResolver())
parser.parse("test.xml")
 
The first call to resolveEntity will be for the external DTD subset
and if you want to control where that is read from, just return the
system identifier you want to use.

(If you want to use a catalog file in a standard way at the moment,
this is how. xmlproc comes with a SAX EntityResolver which reads and
uses a catalog file.)

I hope this helped,
--Lars M.


From larsga@ifi.uio.no  Mon Jun  7 07:06:56 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 07 Jun 1999 08:06:56 +0200
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <37599B2E.A238ED59@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com>
Message-ID: <wkiu90h7cv.fsf@ifi.uio.no>

* Dan Libby
| 
| I'm not sure what you mean here.  Why wasn't it a problem?  Probably
| because people were using SGML to transfer "documents", rather than
| "data", and possibly because the publishers were always trusted?

Both reasons applied, yes. Most SGML applications were in-house, and
so employees were usually trusted not to play dirty tricks, although
one did want to check for mistakes.
 
* Lars Marius Garshol
|
| Ouch, no. <contact-email>, perhaps. Dublin Core doesn't mandate the
| syntax of DC element contents, but using the email address here
| doesn't feel very right.

* Dan Libby
|
| Well, if used correctly in an RDF context, <dc:creator> would just
| be an arc-label that refers to a node that represents you.  That
| node would have other arc-labels named eg: email-address, first
| name, last name, country, etc.

In an RDF context it would be different, but I fear you're giving up
on RSS being easy to author, support and understand then. And I still
don't really like that way of reusing the semantics of the Dublin Core
creator element.
 
| Errg.  xmlproc.  (XMLValParserFactory.make_parser()).  I've been
| talking to Jose, our i18n guy, and it sounds like Python is not
| internally UTF-8 compliant, but he isn't concerned for some
| reason...

Well, xmlproc should parse UTF-8 files just fine at the moment, as
long as you don't use characters above 127 in names, name tokens or
character references. Your application can then view the strings it
gets from xmlproc as byte arrays and simply do its own Unicode
handling to whatever extent is needed.
 
(You can even handle character references yourself by overriding a
method in xmlproc and passing UTF-8-encoded characters to the
application.)

--Lars M.


From danda@netscape.com  Tue Jun  8 05:17:44 1999
From: danda@netscape.com (Dan Libby)
Date: Mon, 07 Jun 1999 21:17:44 -0700
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no>
Message-ID: <375C9968.3574D93@netscape.com>

This is a multi-part message in MIME format.
--------------9D9B38E890EEB29BE54CC4B6
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

1) The version of xmlproc I have does not appear to support any encoding other
than "iso-8859-1".  It returns an error for any other value.  Before, when we were
using xmllib, it simply called handle_xml(), where we were able to look at the
encoding value and make appropriate decisions at the application level.  Does
xmlproc have any equivalent functionality, perhaps in a more recent version?

2) Explanation: I need to preserve XML/HTML entities.  For example, if the
document contains &#37; then I want to print that out exactly, not the
parsed/converted value.  If I don't do this, then any random person can embed html
markup, etc, which could break an HTML page.  This was pretty easy using xmllib -
a non-validating parser, because it simply calls my handler for all entities it
encounters and I can provide the mapping.  However, with xmlproc, it doesn't seem
to call any callback that I can find, it simply looks up the entity in its map and
returns it, or else spits out an error 3021: Undeclared Entity.  So my question
is: Is there a suggested workaround?  I suppose I could always pre-process the
document before giving it to the parser, but that seems pretty messy.

-dan

--------------9D9B38E890EEB29BE54CC4B6
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------9D9B38E890EEB29BE54CC4B6--


From larsga@ifi.uio.no  Tue Jun  8 07:15:07 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 08 Jun 1999 08:15:07 +0200
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
In-Reply-To: <375C9968.3574D93@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com>
Message-ID: <wkvhczdxqs.fsf@ifi.uio.no>

* Dan Libby
| 
| 1) The version of xmlproc I have does not appear to support any
| encoding other than "iso-8859-1". 

This is correct. (Well, US-ASCII will also work, as will anything else
that is based on US-ASCII as long as you don't try to use funny
characters in names or name tokens.)

| [...]  Before, when we were using xmllib, it simply called
| handle_xml(), where we were able to look at the encoding value and
| make appropriate decisions at the application level.  Does xmlproc
| have any equivalent functionality, perhaps in a more recent version?

The functionality is there, but not used at the moment. If you look at
the charconv module you'll see that it contains conversion code for
various encodings as well as registry object for converters.

If you want I can easily add the hooks that would let you use this
functionality. The reason I haven't done this so far is that there
seemed to be no demand for this functionality.
 
| 2) Explanation: I need to preserve XML/HTML entities.  For example,
| if the document contains &#37; then I want to print that out
| exactly, not the parsed/converted value.  If I don't do this, then
| any random person can embed html markup, etc, which could break an
| HTML page.

Hmmm. The cleanest solution to this (from an XML/SGML point of view)
is probably to use string.replace to escape all '<'s in character data
when it is passed to you from the parser. That would also let you
retain parser independence and is cleaner in the sense that it becomes
more obvious what you're really doing.

| However, with xmlproc, it doesn't seem to call any callback that I
| can find, it simply looks up the entity in its map and returns it,
| or else spits out an error 3021: Undeclared Entity.  So my question
| is: Is there a suggested workaround?

If you don't like the solution above you may want to subclass
XMLProcessor in xmlproc.py and write your own versions of
parse_charref and parse_ent_ref.

Instead of rewriting parse_ent_ref you could also just declare the
entities you need in the DTD, and break into the entity hashtable and
modify the value of '&lt;'. (I can show you how.)

If you don't like any of these solutions, let me know, and we'll think
of something.

Also: do you need an option to disallow element and attribute
declarations in the internal subset? 

--Lars M.


From danda@netscape.com  Tue Jun  8 11:22:38 1999
From: danda@netscape.com (Dan Libby)
Date: Tue, 08 Jun 1999 03:22:38 -0700
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no>
Message-ID: <375CEEEE.5CDB4E8F@netscape.com>

This is a multi-part message in MIME format.
--------------131E02F85FB01E50934F9D12
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Lars Marius Garshol wrote:

> * Dan Libby
> |
> | 1) The version of xmlproc I have does not appear to support any
> | encoding other than "iso-8859-1".
>
> This is correct. (Well, US-ASCII will also work, as will anything else
> that is based on US-ASCII as long as you don't try to use funny
> characters in names or name tokens.)
>
> | [...]  Before, when we were using xmllib, it simply called
> | handle_xml(), where we were able to look at the encoding value and
> | make appropriate decisions at the application level.  Does xmlproc
> | have any equivalent functionality, perhaps in a more recent version?
>
> The functionality is there, but not used at the moment. If you look at
> the charconv module you'll see that it contains conversion code for
> various encodings as well as registry object for converters.

Yes, I saw that while I was grepping for something or other and figured it
looked interesting, but was not sure how to plug it in.

> If you want I can easily add the hooks that would let you use this
> functionality. The reason I haven't done this so far is that there
> seemed to be no demand for this functionality.
>

I would appreciate that.  (Consider this 'demand')  Actually, Jose is more
the demand than I am.   Those crazy i18n guys...  ;-)   If it is a simple
change, perhaps you can just send us a diff or something?

> | 2) Explanation: I need to preserve XML/HTML entities.  For example,
> | if the document contains &#37; then I want to print that out
> | exactly, not the parsed/converted value.  If I don't do this, then
> | any random person can embed html markup, etc, which could break an
> | HTML page.
>
> Hmmm. The cleanest solution to this (from an XML/SGML point of view)
> is probably to use string.replace to escape all '<'s in character data
> when it is passed to you from the parser. That would also let you
> retain parser independence and is cleaner in the sense that it becomes
> more obvious what you're really doing.
>

Yes, that is actually the solution I came up with also.  It doesn't really
seem that clean to me, because if there is a character above 127 that we
want to replace with an entity, it gets funny depending on which encoding
is in use.  Whereas in the old model, we simply had a map from eg "180" to
"&#180;" that we returned to the parser and similarly things like "quot"
to "&amp;quot;".

I tried doing this with entity declarations in the DTD and xmlproc just
for kicks.  It would allow it for character based entity names, but didn't
allow any names starting with a numeric.  That means that &#60; would
still slip by, even though we could catch &lt;
eg:
<!ENTITY lt "&amp;#60;"> <!-- works ok -->
<!ENTITY 60 "&amp;#60;"> <!-- illegal -->
<!ENTITY #60 "&amp;#60;"> <!-- illegal -->
<!ENTITY "#60" "&amp;#60;"> <!-- illegal -->

> If you don't like the solution above you may want to subclass
> XMLProcessor in xmlproc.py and write your own versions of
> parse_charref and parse_ent_ref.
>

yeah... icky.  I like being parser independent.  ;-)

> Instead of rewriting parse_ent_ref you could also just declare the
> entities you need in the DTD, and break into the entity hashtable and
> modify the value of '&lt;'. (I can show you how.)
>

I think that is what I just mentioned trying above, but maybe you mean
something else?

> If you don't like any of these solutions, let me know, and we'll think
> of something.
>

Replacing afterwards seems to work ok.  Really we are mostly just
concerned with the "<" and ">".

> Also: do you need an option to disallow element and attribute
> declarations in the internal subset?

Sorry, I'm not sure what this means.  What is the internal subset?

BTW, Lars, I saw your name in an XML book my roommate just picked up. I
forget the title, but it listed xmlproc.  Oh, and just now I saw my friend
Jim's name on the python profiling page.  Totally random!

cheers,
-dan

--------------131E02F85FB01E50934F9D12
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------131E02F85FB01E50934F9D12--


From larsga@ifi.uio.no  Tue Jun  8 11:40:09 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 08 Jun 1999 12:40:09 +0200
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
In-Reply-To: <375CEEEE.5CDB4E8F@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com>
Message-ID: <wklndvdlh2.fsf@ifi.uio.no>

* Dan Libby
| 
| [charconv.py]
|
| I would appreciate that.  (Consider this 'demand') 

OK. This is a very simple change, and I've written the code before, so
I should be able to do this in a couple of days (am very busy at the
moment, and only write email while waiting for compiles and such).

| If it is a simple change, perhaps you can just send us a diff or
| something?

You'll get a ZIP file with 0.61.1 in it. (Easier, I think.)
 
* Lars Marius Garshol
|
| Hmmm. The cleanest solution to this (from an XML/SGML point of view)
| is probably to use string.replace to escape all '<'s in character
| data when it is passed to you from the parser. That would also let
| you retain parser independence and is cleaner in the sense that it
| becomes more obvious what you're really doing.
 
* Dan Libby
|
| Yes, that is actually the solution I came up with also.  It doesn't
| really seem that clean to me, because if there is a character above
| 127 that we want to replace with an entity, it gets funny depending
| on which encoding is in use. 

Well, you control the encoding (after it's gone through xmlproc), so
this shouldn't be a problem.

| Whereas in the old model, we simply had a map from eg "180" to
| "&#180;" that we returned to the parser and similarly things like
| "quot" to "&amp;quot;".

What you're doing here is letting code control the interpretation of
the document, which isn't really all that clean. With and without
custom code the document would be different when parsed.

Simply remapping characters in the output is IMHO a lot cleaner in
that the separation between code and document is clear.
 
| I tried doing this with entity declarations in the DTD and xmlproc
| just for kicks.  It would allow it for character based entity names,
| but didn't allow any names starting with a numeric.

This is because &#60; is not an entity reference, it's a direct
reference to the Unicode character U+0074, and so it's no wonder that
you're not allowed to define such an entity.

| I like being parser independent.  ;-)

Good! It bothers me that most people seem to prefer being chained to
whatever product they're using (parser, database, whatever).
 
* Lars Marius Garshol
|
| Also: do you need an option to disallow element and attribute
| declarations in the internal subset?
 
* Dan Libby
|
| Sorry, I'm not sure what this means.  What is the internal subset?

Here's an example:

<!DOCTYPE rdf:RDF PUBLIC "..." "..." [
  <!-- The line above references the external subset, while what
       appears between the [ and the ]> is the internal subset -->

  <!ELEMENT channel (fiskepudding, kumle, lakrisb�ter)>
]>

<rdf:RDF>
  <channel>
    <fiskepudding>Vondt!</fiskepudding>
    <kumle>OK!</kumle>
    <lakrisb�ter>Godt!</lakrisb�ter>
  </channel>

  ...
</rdf:RDF>
 

xmlproc and all other validating parsers would let this pass with no
complaints at all. I suppose you may not want that.

Oh, and BTW, can I list My Netscape on the xmlproc page as xmlproc
users? (I'm sure you're desperate for the extra hits. :)

--Lars M.


From jim@digicool.com  Tue Jun  8 19:28:54 1999
From: jim@digicool.com (Jim Fulton)
Date: Tue, 08 Jun 1999 18:28:54 +0000
Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation ...
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no>
Message-ID: <375D60E6.3200CF2F@digicool.com>

Some musings....

I'd like to have a very fast and simple parser that can do 
validation.  I'm looking at:

  - Using (or stealing parts of) xmlproc to parse DTDs,

  - Using pyexpat,

  - Writing a C thing that does the validation using 
    data structures (possibly derived from data structures)
    produced by xmlproc.

  - Writing a simple C thing that plugs into the C
    validator, which plugs into pyexpat and takes tables of
    start and end tag handlers and processes XML to produce
    Python objects.

I've modified pyexpat so that it will spit out the DTD info.
(I plan to post an updated pyexpat that implements the full
C expat interface defined in the latest stable expat release,
unless someone beats me to it. ;)

I find that if I tell xmlproc to parse a file containing only
a DTD, it will build the DTD related data structures for me, but:

  - I wonder if there is or should be a tool designed
    just to do this.  Maybe there already is one that I've
    missed.

  - Can I rely on the data structures created by the current
    xmlproc?

I'd like to have a tool for processing DTDs independent of
parsing XML:

  - To make it possible to bolt validation onto non-validating
    parsers,

  - To separate implementation of validation from implementation
    of basic parsing and from application object building code.
    For example, I think handlers that build application objects
    can be alot simpler if they don't have to check validity.

  - Allow applications to provide DTDs for documents that don't
    have them (e.g. xml-rpc marchals).

Thoughts?

Jim

--
Jim Fulton           mailto:jim@digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From danda@netscape.com  Tue Jun  8 21:55:44 1999
From: danda@netscape.com (Dan Libby)
Date: Tue, 08 Jun 1999 13:55:44 -0700
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com> <wklndvdlh2.fsf@ifi.uio.no>
Message-ID: <375D8350.6F0F46C6@netscape.com>

This is a multi-part message in MIME format.
--------------581B652EDB59D0C2E09B5BB5
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

> | Sorry, I'm not sure what this means.  What is the internal subset?
>
> Here's an example:
>
> <!DOCTYPE rdf:RDF PUBLIC "..." "..." [
>   <!-- The line above references the external subset, while what
>        appears between the [ and the ]> is the internal subset -->
>
>   <!ELEMENT channel (fiskepudding, kumle, lakrisb�ter)>
> ]>
>
> <rdf:RDF>
>   <channel>
>     <fiskepudding>Vondt!</fiskepudding>
>     <kumle>OK!</kumle>
>     <lakrisb�ter>Godt!</lakrisb�ter>
>   </channel>
>
>   ...
> </rdf:RDF>
>
>
> xmlproc and all other validating parsers would let this pass with no
> complaints at all. I suppose you may not want that.
>

Oh, I see.  Yeah, I wondered what would happen in that case.  You're right
that we wouldn't want it, however I have a secondary pseudo-schema checker
that would not allow the unknown tags, so its not really a problem for
us.  Further, since "channel" was already defined in the external dtd,
shouldn't that generate an error, or does the parser just override it with
the internal subset definition?

> Oh, and BTW, can I list My Netscape on the xmlproc page as xmlproc
> users? (I'm sure you're desperate for the extra hits. :)

Sure; it's not actually in production yet, but I'm certainly using it. ;-)

-dan


--------------581B652EDB59D0C2E09B5BB5
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------581B652EDB59D0C2E09B5BB5--


From danda@netscape.com  Tue Jun  8 21:58:52 1999
From: danda@netscape.com (Dan Libby)
Date: Tue, 08 Jun 1999 13:58:52 -0700
Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation
 ...
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com>
Message-ID: <375D840C.3AD7C01C@netscape.com>

This is a multi-part message in MIME format.
--------------A93EB40EB4CDB212C19ACA84
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

>
>   - Allow applications to provide DTDs for documents that don't
>     have them (e.g. xml-rpc marchals).
>

oh!  This would be cool.  For RSS 0.9, we didn't require a DTD, but now
I'm validating against one.  So basically I'm pre-processing the buffer
and inserting a DTD.  Kind of a hack, but it works.  I'd prefer a general
solution.

-dan

--------------A93EB40EB4CDB212C19ACA84
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------A93EB40EB4CDB212C19ACA84--


From larsga@ifi.uio.no  Tue Jun  8 23:26:53 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 09 Jun 1999 00:26:53 +0200
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and validation ...
In-Reply-To: <375D60E6.3200CF2F@digicool.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com>
Message-ID: <wklnducor6.fsf@ifi.uio.no>

* Jim Fulton
| 
| I'd like to have a very fast and simple parser that can do
| validation.

Hmmm. Maybe a better option than what you've been looking at would be
RXP, which is an all-C validating parser. 

<URL: http://www.cogsci.ed.ac.uk/~richard/rxp.html>

It's a little bit slower than expat, but that should drown in the time
occupied by the Python callbacks anyway. 

I've been thinking about writing a Python interface to RXP, but am not
really into C extensions yet and haven't got the time at the moment.

|   - Using (or stealing parts of) xmlproc to parse DTDs,

This is easily possible, and it will buy you some performance,
although probably not as much as you'd wish. (Especially for large
DTDs xmlproc is slow.)
 
| (I plan to post an updated pyexpat that implements the full
| C expat interface defined in the latest stable expat release,
| unless someone beats me to it. ;)

Great! When you do I'll update the SAX driver.
 
| I find that if I tell xmlproc to parse a file containing only a DTD,
| it will build the DTD related data structures for me, but:
| 
|   - I wonder if there is or should be a tool designed
|     just to do this.  Maybe there already is one that I've
|     missed.

xmlproc comes with a dtdparser.py module which gives you an
event-based interface to DTDs. Combined with the classes in xmldtd.py
this gives you the ability to parse a DTD without an associated
document. Look in the demo directory for dtddoc.py, which is an
example of this.
 
|   - Can I rely on the data structures created by the current
|     xmlproc?

Sorry, I don't understand the question. What do you mean by 'rely'?
 
| I'd like to have a tool for processing DTDs independent of
| parsing XML:
|
| [excellent reasons snipped]

Yup. These were all part of my motivation for making the DTD parsing
module of xmlproc separate from the rest.

--Lars M.


From jim@digicool.com  Wed Jun  9 02:25:45 1999
From: jim@digicool.com (Jim Fulton)
Date: Wed, 09 Jun 1999 01:25:45 +0000
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and
 validation ...
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com> <wklnducor6.fsf@ifi.uio.no>
Message-ID: <375DC299.42477A0A@digicool.com>

Lars Marius Garshol wrote:
> 
> * Jim Fulton
> |
> | I'd like to have a very fast and simple parser that can do
> | validation.
> 
> Hmmm. Maybe a better option than what you've been looking at would be
> RXP, which is an all-C validating parser.
> 
> <URL: http://www.cogsci.ed.ac.uk/~richard/rxp.html>

I'll check it out.  I'm a little bit worried about the license, 
which is GPL. Maybe I can get him to change it to LGPL.

> It's a little bit slower than expat, but that should drown in the time
> occupied by the Python callbacks anyway.

True, although for alot of our projects, we'll probably write 
many (most?) of the callbacks in C.
 
> I've been thinking about writing a Python interface to RXP, but am not
> really into C extensions yet and haven't got the time at the moment.
> 
> |   - Using (or stealing parts of) xmlproc to parse DTDs,
> 
> This is easily possible, and it will buy you some performance,
> although probably not as much as you'd wish. (Especially for large
> DTDs xmlproc is slow.)

In Most cases, I'd expect to amortize DTD parsing over many documents, 
either by preprocessing standard DTDs or catching DTDs.
 
(snip)
 
> | I find that if I tell xmlproc to parse a file containing only a DTD,
> | it will build the DTD related data structures for me, but:
> |
> |   - I wonder if there is or should be a tool designed
> |     just to do this.  Maybe there already is one that I've
> |     missed.
> 
> xmlproc comes with a dtdparser.py module which gives you an
> event-based interface to DTDs. Combined with the classes in xmldtd.py
> this gives you the ability to parse a DTD without an associated
> document.

I suspected this, but I had trouble figuring out the interface.

> Look in the demo directory for dtddoc.py, which is an
> example of this.

Ah, thanks.  That should help alot.
 
> |   - Can I rely on the data structures created by the current
> |     xmlproc?
> 
> Sorry, I don't understand the question. What do you mean by 'rely'?

I'll write something that takes as input the data structures
created internally whan xmlproc parses a document.  If you change 
those data structures, my software will break. :)
 
> | I'd like to have a tool for processing DTDs independent of
> | parsing XML:
> |
> | [excellent reasons snipped]
> 
> Yup. These were all part of my motivation for making the DTD parsing
> module of xmlproc separate from the rest.

Cool.

Jim

--
Jim Fulton           mailto:jim@digicool.com
Technical Director   (888) 344-4332              Python Powered!
Digital Creations    http://www.digicool.com     http://www.python.org

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From tpassin@idsonline.com  Wed Jun  9 04:46:15 1999
From: tpassin@idsonline.com (Thomas B. Passin)
Date: Tue, 8 Jun 1999 23:46:15 -0400
Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation ...
Message-ID: <003101beb22a$a0fa95a0$1c15b0cf@tpassinids>

Jim Fulton wrote

> ..Allow applications to provide DTDs for documents that don't
    have them (e.g. xml-rpc marchals).

Yes, I think this can be very useful.  But if you reverse-engineer a DTD from any existing document, there is no unique solution.
The program will therefore try to guess what it should do, and the result will have to be hand-adjusted to make it more usable.  I
found a product (I forget right now which one) that can create a DTD from an XML example, and tried it.  Interesting results, and I
had to work on the DTD by hand.

Tom Passin


From jim@digicool.com  Wed Jun  9 12:27:42 1999
From: jim@digicool.com (Jim Fulton)
Date: Wed, 09 Jun 1999 11:27:42 +0000
Subject: [XML-SIG] While we're on the subject of xmlproc, DTDs and validation
 ...
References: <003101beb22a$a0fa95a0$1c15b0cf@tpassinids>
Message-ID: <375E4FAE.559FC624@digicool.com>

"Thomas B. Passin" wrote:
> 
> Jim Fulton wrote
> 
> > ..Allow applications to provide DTDs for documents that don't
>     have them (e.g. xml-rpc marchals).
> 
> Yes, I think this can be very useful.  But if you reverse-engineer a DTD from any existing document, there is no unique solution.

That's not what I'm thinking of.  My application might require data that follows a DTD, 
but I might not require incoming data to include the DTD (or even a reference to it).

There are XML formats (e.g. XML-RPC) around that are precisely defined, but not 
with a DTD. I can come up with a DTD for them and validate conforming data that, 
of course, doesn't include or reference a DTD.

Also, by separating DTD parsing from validation, there could be a way
of using schema data in other formats (e.g. RSS schema?), as long as 
the other formats could be parsed into the same data structures that 
DTD's get parsed to.

Jim

--
Jim Fulton           mailto:jim@digicool.com
Technical Director   (888) 344-4332              Python Powered!
Digital Creations    http://www.digicool.com     http://www.python.org

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From Ted.Horst@wdr.com  Wed Jun  9 15:27:26 1999
From: Ted.Horst@wdr.com (Ted Horst)
Date: Wed,  9 Jun 99 09:27:26 -0500
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and
 validation ...
In-Reply-To: <375DC299.42477A0A@digicool.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no>
 <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no>
 <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com>
 <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com>
 <wklnducor6.fsf@ifi.uio.no> <375DC299.42477A0A@digicool.com>
Message-ID: <199906091427.AA19568@ch1d2833nwk>

You might also check out the xml parser in ILU.  It is an all C validationg  
parser as well, and the license is less restrictive.

ftp://ftp.parc.xerox.com/pub/ilu/ilu.html

Ted Horst


On Wed, 09 Jun 1999, Jim Fulton wrote:
> Lars Marius Garshol wrote:
> >
> > * Jim Fulton
> > |
> > | I'd like to have a very fast and simple parser that can do
> > | validation.
> >
> > Hmmm. Maybe a better option than what you've been looking at would be
> > RXP, which is an all-C validating parser.
> >
> > <URL: <http://www.cogsci.ed.ac.uk/~richard/rxp.html>
>
> I'll check it out.  I'm a little bit worried about the license,
> which is GPL. Maybe I can get him to change it to LGPL.


From Fred L. Drake, Jr." <fdrake@acm.org  Thu Jun 10 22:33:44 1999
From: Fred L. Drake, Jr." <fdrake@acm.org (Fred L. Drake)
Date: Thu, 10 Jun 1999 17:33:44 -0400 (EDT)
Subject: [XML-SIG] Re: RSS and stuff
In-Reply-To: <wkvhd253c7.fsf@ifi.uio.no>
References: <3754E5B9.96A9FD54@netscape.com>
 <wkpv3f6l3d.fsf@ifi.uio.no>
 <3755BACF.BDC54B15@netscape.com>
 <wkvhd253c7.fsf@ifi.uio.no>
Message-ID: <14176.12088.820355.924645@weyr.cnri.reston.va.us>

Lars Marius Garshol writes:
 > be well; I've emailed them, but to no avail) and the python.org one is
 > not well-formed because it has a <link>...</linK> pair.

  I can't find this in our work area or on the server, so this problem 
has aged away.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     <fdrake@acm.org>
Corporation for National Research Initiatives


From larsga@ifi.uio.no  Thu Jun 10 22:54:08 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 10 Jun 1999 23:54:08 +0200
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and   validation ...
In-Reply-To: <375DC299.42477A0A@digicool.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com> <wklnducor6.fsf@ifi.uio.no> <375DC299.42477A0A@digicool.com>
Message-ID: <wkiu8vyb5r.fsf@ifi.uio.no>

* Jim Fulton
| 
| In Most cases, I'd expect to amortize DTD parsing over many
| documents, either by preprocessing standard DTDs or catching DTDs.

I've looked at this and found it to not be entirely straightforward
due to the problems introduced by the internal subset. Also, early
tests showed that the speedup from using pickle to load DTD objects
was just by a factor of 4 (if I remember correctly) over normal DTD
parsing.
  
Anyway, if you disallow the internal subset I have most of the code
necessary to do this written although no integrated yet.

| [using dtdparser.py]
| 
| I suspected this, but I had trouble figuring out the interface.

Feel free to ask if you find the documentation hard to understand.
That will help me improve it (once I have the time to do so, at
least).

| [reliability of xmldtd.py structure] 
| 
| I'll write something that takes as input the data structures created
| internally whan xmlproc parses a document.  If you change those data
| structures, my software will break. :)

The intention is that all documented APIs in xmlproc should remain
unchanged as far as possible (although they can be extended and also
change semantics in backward-compatible ways), and although there are
many things I would like to clean up I refrain from doing so.
  
So, yes, you should be able to rely on them.

--Lars M.


From danda@netscape.com  Thu Jun 10 23:08:41 1999
From: danda@netscape.com (Dan Libby)
Date: Thu, 10 Jun 1999 15:08:41 -0700
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and
 validation ...
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com> <wklnducor6.fsf@ifi.uio.no> <375DC299.42477A0A@digicool.com> <wkiu8vyb5r.fsf@ifi.uio.no>
Message-ID: <37603769.D0A16E7C@netscape.com>

This is a multi-part message in MIME format.
--------------FC7C188F8BF054F1DA0AE26A
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

> I've looked at this and found it to not be entirely straightforward
> due to the problems introduced by the internal subset. Also, early
> tests showed that the speedup from using pickle to load DTD objects
> was just by a factor of 4 (if I remember correctly) over normal DTD
> parsing.
>

Well... excluding potentially having to grab the file off the network
somewhere, which would be the slowest operation.  That's why in my code I
check if the external DTD is in my map, and if so, use a local copy.  If
the pickling sped it up by another factor of 4, that would be great.

> Anyway, if you disallow the internal subset I have most of the code
> necessary to do this written although no integrated yet.
>

I would be very interested in this.

-dan

--------------FC7C188F8BF054F1DA0AE26A
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------FC7C188F8BF054F1DA0AE26A--


From jim@digicool.com  Fri Jun 11 13:45:19 1999
From: jim@digicool.com (Jim Fulton)
Date: Fri, 11 Jun 1999 08:45:19 -0400
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and
 validation ...
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com> <wklnducor6.fsf@ifi.uio.no> <375DC299.42477A0A@digicool.com> <wkiu8vyb5r.fsf@ifi.uio.no>
Message-ID: <376104DF.EE7367A3@digicool.com>


Lars Marius Garshol wrote:
> 
> * Jim Fulton
>
> | [using dtdparser.py]
> |
> | I suspected this, but I had trouble figuring out the interface.
> 
> Feel free to ask if you find the documentation hard to understand.
> That will help me improve it (once I have the time to do so, at
> least).

Actually, I somehow failed to notice the xmlproc directory in the
doc directory, so I missed the docs altogether.
 
Jim

--
Jim Fulton           mailto:jim@digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From larsga@ifi.uio.no  Fri Jun 11 14:17:48 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 11 Jun 1999 15:17:48 +0200
Subject: [XML-SIG] Re: While we're on the subject of xmlproc, DTDs and     validation ...
In-Reply-To: <37603769.D0A16E7C@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <3755BACF.BDC54B15@netscape.com> <wkvhd253c7.fsf@ifi.uio.no> <37599B2E.A238ED59@netscape.com> <375B3840.68F36912@netscape.com> <wkk8tgh7t4.fsf@ifi.uio.no> <375D60E6.3200CF2F@digicool.com> <wklnducor6.fsf@ifi.uio.no> <375DC299.42477A0A@digicool.com> <wkiu8vyb5r.fsf@ifi.uio.no> <37603769.D0A16E7C@netscape.com>
Message-ID: <wkk8tax4eb.fsf@ifi.uio.no>

* Lars Marius Garshol
|
| Anyway, if you disallow the internal subset I have most of the code
| necessary to do this written although no integrated yet.

* Dan Libby
| 
| I would be very interested in this.

Then I'll try to get together an xmlproc 0.62 with this and the
charconv stuff. If I can find a free night during the weekend I'll sit
down and do all these other little things that have been piling up and
put out a slew of new releases.

--Lars M.


From fredrik@pythonware.com  Mon Jun 14 16:32:46 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 14 Jun 1999 17:32:46 +0200
Subject: [XML-SIG] Re: RSS and stuff
References: <3754E5B9.96A9FD54@netscape.com>
Message-ID: <00e101beb67b$27814b10$f29b12c2@pythonware.com>

Dan wrote:
> What would you like to see / not see in the format?  It really is just
> supposed to be a summary. Ideally, we would like to support all of
> Dublin Core eventually, but the problem is that the additional data may
> not actually be used, and marketing folks felt it would be simpler to
> not confuse folks too much.

just noticed that the my.userland.com folks are also
discussing RDF extensions and supersets. if anyone's
interested, check:

http://discuss.userland.com/msgReader$7333
http://alchemy.openjava.org/ocs/

</F>


From Jeff Rush" <jrush@summit-research.com  Thu Jun 17 13:18:53 1999
From: Jeff Rush" <jrush@summit-research.com (Jeff Rush)
Date: Thu, 17 Jun 99 07:18:53 -0500
Subject: [XML-SIG] Seeking HTML => (DOM) => Python Object Recipes
Message-ID: <199906170716.3331398.6@summit-research.com>

I'm just starting to get into XML and just joined this list.  I'm
working on a Python agent-program that visits bank web pages
and fetches checkbook registers, parsing the HTML via the
python-xml-0.5.1 stuff into a DOM tree.  When finished, it will
then spit some DTD flavor of XML into a
digitally-signed/encrypted email msg.

What I'm looking for is better extraction of HTML tables.  Has
anyone written a good class for that?  I've got a crude one,
but am hoping others have done extensive parsing of pages
using XML and developed a toolkit.

-Jeff Rush

----- cut here -----

class ExtractTable(xml.dom.walker.Walker):
    def __init__(self, tablenode, trim=0, headings=1, allrows=1):
        self.rows = []
        self.row = []
        self.text = ""
        self.nowhitespace = trim
        self.keepheadings = headings
        self.allrows = allrows
        self.walk(tablenode)

    def startElement(self, node):
        if node.get_nodeName() == 'TR':
            self.row  = []
        elif self.keepheadings and node.get_nodeName() == 'TH':
            self.text = ""
            self.row.append({})
        elif node.get_nodeName() == 'TD':
            self.text = ""
            self.row.append({})

    def endElement(self, node):
        if self.keepheadings and node.get_nodeName() == 'TH':
            self.row[-1].update( {'type': 'header', 'value': self.text} )

        elif node.get_nodeName() == 'TD':
            self.row[-1].update( {'type': 'data', 'value': self.text} )

        elif node.get_nodeName() == 'A' :
            self.row[-1]['link'] = node.getAttribute('HREF')

        elif node.get_nodeName() == 'TR':
            if self.allrows or len(self.row) > 0:
                self.rows.append(self.row)

    def doText(self, node):
        str = node.get_data()
        while len(str) and str[0] in ('\r', '\n'):
            str = str[1:]

        if self.nowhitespace:
            str = string.strip(str)

        self.text = self.text + str

    def doComment(self, node):
        pass

    def doOtherNode(self, node):
        str = {
          'nbsp': ' '
        }.get(node.get_nodeName(), None)
        if str is not None:
            self.text = self.text + str

def ExtractLinks(topnode):
    """Scan and extract all links in given subtree of HTML page"""

    links = []
    for node in topnode.getElementsByTagName('A'):
        url = node.getAttribute('HREF')
        if url:
            links.append(url)

    return links

----- cut here -----


From Jeff Rush" <jrush@summit-research.com  Sat Jun 19 10:45:03 1999
From: Jeff Rush" <jrush@summit-research.com (Jeff Rush)
Date: Sat, 19 Jun 99 04:45:03 -0500
Subject: [XML-SIG] A Few Bugs in dom/transformer.py
Message-ID: <199906190442.4130325.6@summit-research.com>

I've checked the XML-SIG mailing list archives and the latest
CVS for updates to dom/transformer.py but didn't see any.
Hence...

Bug #1:
    Throughout the dom/transformer.py, reference is made to
    'NodeType' but the correct name is 'nodeType'.

Bug #2:
    While trying to create a subclass of Transformer, in order to
     strip out HTML formatting/graphics tags, I hit a problem where
     v0.5.1 of Transformer won't modify the DOM tree it walks.

    ----- old code -----
	new_children = []
	for child in node.getChildren():
		new_children = new_children + self._transform_node(child)
	node._children = new_children
    ----- old code -----

    Nodes don't have a '_children' attribute and besides, this doesn't
    update the node's parentdict, hence any changes are not seen
    by the higher DOM tree levels.

    ----- new code ------
	new_children = []
	for child in node.childNodes:
		new_children = new_children + self._transform_node(child)

	for child in node.childNodes[:] : # Remove Old Children
		node.removeChild(child)

	for child in new_children:    # And Replace with (0 or more) New
		node.appendChild(child)
    ----- new code -----

Suggestion #1:
     Define a __call__ method in the Transformer class that
     calls the existing transform method, so the following works:

	class FormatStripper(Transformer):
		....
	strip_formatting = FormatStripper()

	strip_formatting(doc)

I can now write my stripping transformers as:

---------- cut here ----------
class FormatStripper(xml.dom.transformer.Transformer):
	def do_FONT(self, node):	return node.childNodes
	def do_B(self, node):		return node.childNodes
	def do_I(self, node):		return node.childNodes

strip_formatting = FormatStripper()

class GraphicsStripper(Transformer):
	def do_HR(self, node):		return [] # Remove Horizontal Rules
	def do_IMG(self, node):	return [] # Remove Images
	def do_MAP(self, node):	return [] # Remove Image Maps

	def do_BODY(self, node):
		node.removeAttribute("BACKGROUND")
		node.removeAttribute("BGCOLOR")
		return [node]

strip_graphics = GraphicsStripper()

....

doc = strip_formatting( strip_graphics( doc ) )

---------- cut here ----------

If acceptable, I'd like to see some form of these added to the
dom.utils module; they seem to fit in with the strip_whitespace
function.

-Jeff Rush


From steynj@postino.up.ac.za  Sat Jun 19 14:30:45 1999
From: steynj@postino.up.ac.za (Jacques Steyn)
Date: Sat, 19 Jun 1999 15:30:45 +0200
Subject: [XML-SIG] Inquiry: Python
Message-ID: <376B9B85.2627308B@postino.up.ac.za>

How can one obtain the Python XML software?
Thanks
Jacques

--
______________________________________________
Jacques Steyn (PhD)
Associate Professor: Multimedia
Department of Information Science
School for Information Technology
University Pretoria
Pretoria
South Africa

Tel +27 12 420 4258
Fax +27 12 362 5181
Email: jsteyn@up.ac.za
Web:
Information Science
http://is.up.ac.za
School for Information Technology
http://sit.up.ac.za


From larsga@ifi.uio.no  Sat Jun 19 15:10:08 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 19 Jun 1999 16:10:08 +0200
Subject: [XML-SIG] Inquiry: Python
In-Reply-To: <376B9B85.2627308B@postino.up.ac.za>
References: <376B9B85.2627308B@postino.up.ac.za>
Message-ID: <wk4sk4uvr3.fsf@ifi.uio.no>

* Jacques Steyn
|
| How can one obtain the Python XML software?

You can find it here:

<URL: http://www.python.org/sigs/xml-sig/files/>

--Lars M.


From r.hooft@euromail.net  Sun Jun 20 10:30:04 1999
From: r.hooft@euromail.net (Rob Hooft)
Date: Sun, 20 Jun 1999 11:30:04 +0200 (MZT)
Subject: [XML-SIG] Inquiry: Python
In-Reply-To: <wk4sk4uvr3.fsf@ifi.uio.no>
References: <376B9B85.2627308B@postino.up.ac.za>
 <wk4sk4uvr3.fsf@ifi.uio.no>
Message-ID: <14188.46236.604650.495677@octopus.chem.uu.nl>

>>>>> "LMG" == Lars Marius Garshol <larsga@ifi.uio.no> writes:

 | | How can one obtain the Python XML software?

 LMG> You can find it here:

 LMG> <URL: http://www.python.org/sigs/xml-sig/files/>

Please note that the text on http://www.python.org/sigs/xml-sig/status.html
still points to an older version. Maybe that page should be revised.

Regards,

Rob Hooft.

-- 
=====   R.Hooft@EuroMail.net   http://www.xs4all.nl/~hooft/rob/  =====
=====   R&D, Nonius BV, Delft  http://www.nonius.nl/             =====
===== PGPid 0xFA19277D ========================== Use Linux! =========


From Jeff Rush" <jrush@summit-research.com  Sun Jun 20 11:29:37 1999
From: Jeff Rush" <jrush@summit-research.com (Jeff Rush)
Date: Sun, 20 Jun 99 05:29:37 -0500
Subject: [XML-SIG] Inquiry: Python
Message-ID: <199906200527.1338717.6@summit-research.com>

If by chance you are running some form of Linux that
supports the RPM packaging technology, you can
grab an easy-to-install XML RPM at my web page:

    http://starship.python.net/crew/jrush/XML/

-Jeff Rush

On Sat, 19 Jun 1999 15:30:45 +0200, Jacques Steyn wrote:

>How can one obtain the Python XML software?
>Thanks
>Jacques


From fredrik@pythonware.com  Sun Jun 20 16:13:52 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Sun, 20 Jun 1999 17:13:52 +0200
Subject: [XML-SIG] ann: new sgmlop snapshot
References: <199906190442.4130325.6@summit-research.com>
Message-ID: <001401bebb2f$820cda50$f29b12c2@pythonware.com>

subject says most of it; get your copy here:

http://www.pythonware.com/madscientist/

coming soon: a lightweight "dom" layer on
top of the new Element datatype, unicode
support, and more.

</F>


From Jeff Rush" <jrush@summit-research.com  Sun Jun 20 22:54:38 1999
From: Jeff Rush" <jrush@summit-research.com (Jeff Rush)
Date: Sun, 20 Jun 99 16:54:38 -0500
Subject: [XML-SIG] ann: new sgmlop snapshot
Message-ID: <199906201652.1434729.6@summit-research.com>

Is this the same sgmlop as in the XML-SIG CVS?  Have
your most recent changes gotten into the CVS or should
I add your tarball to my RPM explicitly in order to stay
up-to-date re sgmlop?

an-xml-newbie-not-sure-how-all-the-sig-pieces-are-managed-ly y'rs - Jeff


On Sun, 20 Jun 1999 17:13:52 +0200, Fredrik Lundh wrote:

>subject says most of it; get your copy here:
>
>http://www.pythonware.com/madscientist/
>
>coming soon: a lightweight "dom" layer on
>top of the new Element datatype, unicode
>support, and more.
>
></F>
>
>
>_______________________________________________
>XML-SIG maillist  -  XML-SIG@python.org
>http://www.python.org/mailman/listinfo/xml-sig
>


From danda@netscape.com  Sun Jun 20 23:00:09 1999
From: danda@netscape.com (Dan Libby)
Date: Sun, 20 Jun 1999 15:00:09 -0700
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com> <wklndvdlh2.fsf@ifi.uio.no> <375D8350.6F0F46C6@netscape.com>
Message-ID: <376D6469.C3A561C8@netscape.com>

This is a multi-part message in MIME format.
--------------06E333EA1D62942978B070C3
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Is it possible to reference more than one external DTD? If so, how?  I'm
hoping that it is possible to include an external DTD from within the internal
subset.  This would basically allow for limited inheritance.

-dan


Dan Libby wrote:

> > Here's an example:
> >
> > <!DOCTYPE rdf:RDF PUBLIC "..." "..." [
> >   <!-- The line above references the external subset, while what
> >        appears between the [ and the ]> is the internal subset -->
> >
> >   <!ELEMENT channel (fiskepudding, kumle, lakrisb�ter)>
> > ]>
> >

--------------06E333EA1D62942978B070C3
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------06E333EA1D62942978B070C3--


From larsga@ifi.uio.no  Sun Jun 20 23:38:59 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 21 Jun 1999 00:38:59 +0200
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
In-Reply-To: <376D6469.C3A561C8@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com> <wklndvdlh2.fsf@ifi.uio.no> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com>
Message-ID: <wklndets3g.fsf@ifi.uio.no>

* Dan Libby
| 
| Is it possible to reference more than one external DTD? If so, how?
| I'm hoping that it is possible to include an external DTD from
| within the internal subset.  

It is. Or you could do it from the external subset.

| This would basically allow for limited inheritance.

Hmmm. Some sub-typing would be possible in this way, yes. However, if
you want to do that properly you should look at architectural forms.
They're much simpler than they sound, and with Geir Ove's xmlarch
they're also easy to use.

See

<URL: http://www.infotek.no/~grove/software/xmlarch/xmlarch.html>

for more info. (xmlarch is also in the XML-SIG package.)

--Lars M.


From danda@netscape.com  Mon Jun 21 08:09:51 1999
From: danda@netscape.com (Dan Libby)
Date: Mon, 21 Jun 1999 07:09:51 +0000
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com> <wklndvdlh2.fsf@ifi.uio.no> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> <wklndets3g.fsf@ifi.uio.no>
Message-ID: <376DE53F.CF441F4@netscape.com>

This is a multi-part message in MIME format.
--------------8CA34653FD46448DBE65918D
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Lars Marius Garshol wrote:

> * Dan Libby
> |
> | Is it possible to reference more than one external DTD? If so, how?
> | I'm hoping that it is possible to include an external DTD from
> | within the internal subset.
>
> It is. Or you could do it from the external subset.
>

Yeah, but how?  I tried the following with xmlproc:

<?xml version="1.0"?>
<!DOCTYPE test SYSTEM "http://myhost/ver-1.0.dtd" [
<!ENTITY % otherdtd SYSTEM "http://myhost/other.dtd">
%otherdtd;
]>

This always gives me the error:
Illegal construct at 5:3

I tried other variations on the same theme of course, but with similar
results.  Both files are in the correct path.

-dan


--------------8CA34653FD46448DBE65918D
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;501 E Middlefield Rd;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
title:Coder Surfer
x-mozilla-cpt:;0
fn:Dan Libby
end:vcard

--------------8CA34653FD46448DBE65918D--


From larsga@ifi.uio.no  Mon Jun 21 09:50:59 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 21 Jun 1999 10:50:59 +0200
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
In-Reply-To: <376DE53F.CF441F4@netscape.com>
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com> <wklndvdlh2.fsf@ifi.uio.no> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> <wklndets3g.fsf@ifi.uio.no> <376DE53F.CF441F4@netscape.com>
Message-ID: <wkg13mszrg.fsf@ifi.uio.no>

* Dan Libby
| 
| Yeah, but how?  I tried the following with xmlproc:
| 
| <?xml version="1.0"?>
| <!DOCTYPE test SYSTEM "http://myhost/ver-1.0.dtd" [
| <!ENTITY % otherdtd SYSTEM "http://myhost/other.dtd">
| %otherdtd;
| ]>
| 
| This always gives me the error:
| Illegal construct at 5:3

This works perfectly for me with the following two files:

<!DOCTYPE test [
  <!ENTITY % ext SYSTEM "test2.dtd">
  %ext;
]>

<root>
</root>
 
and in test2.dtd:

<!ELEMENT root (#PCDATA)>


This works for me with both the xmlproc in my CVS tree and the one in
the XML-SIG CVS tree (which is 0.61), with both validating and
non-validating parsing.  Which version do you have?  (Give me the CVS
ID tag in dtdparser.py to be 100% sure that it's right.)

--Lars M.


From fredrik@pythonware.com  Mon Jun 21 13:49:17 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 21 Jun 1999 14:49:17 +0200
Subject: [XML-SIG] ann: new sgmlop snapshot
References: <199906201652.1434729.6@summit-research.com>
Message-ID: <007901bebbe4$7a340640$f29b12c2@pythonware.com>

Jeff Rush wrote:
> Is this the same sgmlop as in the XML-SIG CVS?

well, I haven't put it there...

> Have your most recent changes gotten into the CVS or
> should I add your tarball to my RPM explicitly in order to
> stay up-to-date re sgmlop?

beats me. I just write this stuff, I don't
know what people do with it...

> an-xml-newbie-not-sure-how-all-the-sig-pieces-are-managed-ly y'rs - Jeff

no different from me, then ;-)

Cheers /F


From larsga@ifi.uio.no  Mon Jun 21 14:08:42 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 21 Jun 1999 15:08:42 +0200
Subject: [XML-SIG] ann: new sgmlop snapshot
In-Reply-To: <199906201652.1434729.6@summit-research.com>
References: <199906201652.1434729.6@summit-research.com>
Message-ID: <wk674hu2ed.fsf@ifi.uio.no>

* Jeff Rush
|
| Is this the same sgmlop as in the XML-SIG CVS? 

That one is dated 03.Dec.98, so I very much doubt it. AMK probably
hasn't had the time to add it in yet.

| Have your most recent changes gotten into the CVS or should I add
| your tarball to my RPM explicitly in order to stay up-to-date re
| sgmlop?

I suppose that depends on whether you want your RPMs to reflect the
latest XML-SIG package or the latest released software. Perhaps the
best is if you get write access to the CVS and can help make things so
that the RPM can actually be both at the same time.

--Lars M.


From akuchlin@mems-exchange.org  Mon Jun 21 14:25:30 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 21 Jun 1999 09:25:30 -0400 (EDT)
Subject: [XML-SIG] ann: new sgmlop snapshot
In-Reply-To: <199906201652.1434729.6@summit-research.com>
References: <199906201652.1434729.6@summit-research.com>
Message-ID: <14190.15690.90340.112376@amarok.cnri.reston.va.us>

Jeff Rush writes:
>Is this the same sgmlop as in the XML-SIG CVS?  Have
>your most recent changes gotten into the CVS or should
>I add your tarball to my RPM explicitly in order to stay
>up-to-date re sgmlop?

No, I haven't gotten around to updating the CVS tree.  Nor have I
gotten around to mailing out the passwords for write access to the CVS 
tree to various people; will try to do that today...

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
We are always living in the final days. What have you got? A hundred years or
much, much less until the end of your world.
    -- From SIGNAL TO NOISE


From danda@netscape.com  Mon Jun 21 19:27:59 1999
From: danda@netscape.com (Dan Libby)
Date: Mon, 21 Jun 1999 11:27:59 -0700
Subject: [XML-SIG] 2 Qs: encoding & entities with xmlproc
References: <3754E5B9.96A9FD54@netscape.com> <wkpv3f6l3d.fsf@ifi.uio.no> <375C9968.3574D93@netscape.com> <wkvhczdxqs.fsf@ifi.uio.no> <375CEEEE.5CDB4E8F@netscape.com> <wklndvdlh2.fsf@ifi.uio.no> <375D8350.6F0F46C6@netscape.com> <376D6469.C3A561C8@netscape.com> <wklndets3g.fsf@ifi.uio.no> <376DE53F.CF441F4@netscape.com> <wkg13mszrg.fsf@ifi.uio.no>
Message-ID: <376E842E.411A16F8@netscape.com>

This is a multi-part message in MIME format.
--------------0956DD3E644C13D7D92EBF9B
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Okay, stupid mistake. I have code in EntityResolver that maps a network
address to a local address.  When I moved it out of doctype into an
entity, I forgot about that, so it was really pointing at nothing.  It
works now.  One thing puzzles me though: the comment in EntityResolver
indicates that resolveEntity will be called to resolve all external
entities. Instead, I only see it called for the !DOCTYPE tag, not
entities in the internal or external subsets.

Also, that error message could use some work, "file not found" is easier
to understand. ;-)

I don't see any CVS tag in dtdparser.py, but the one in xmlproc.py is:
$Id: xmlproc.py,v 1.7 1999/02/10 01:46:03 amk Exp $

-dan

Lars Marius Garshol wrote:

> * Dan Libby
> |
> | Yeah, but how?  I tried the following with xmlproc:
> |
> | <?xml version="1.0"?>
> | <!DOCTYPE test SYSTEM "http://myhost/ver-1.0.dtd" [
> | <!ENTITY % otherdtd SYSTEM "http://myhost/other.dtd">
> | %otherdtd;
> | ]>
> |
> | This always gives me the error:
> | Illegal construct at 5:3
>
> This works perfectly for me with the following two files:
>
> <!DOCTYPE test [
>   <!ENTITY % ext SYSTEM "test2.dtd">
>   %ext;
> ]>
>
> <root>
> </root>
>
> and in test2.dtd:
>
> <!ELEMENT root (#PCDATA)>
>
> This works for me with both the xmlproc in my CVS tree and the one in
> the XML-SIG CVS tree (which is 0.61), with both validating and
> non-validating parsing.  Which version do you have?  (Give me the CVS
> ID tag in dtdparser.py to be 100% sure that it's right.)
>
> --Lars M.
>
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://www.python.org/mailman/listinfo/xml-sig

--------------0956DD3E644C13D7D92EBF9B
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------0956DD3E644C13D7D92EBF9B--


From fredrik@pythonware.com  Mon Jun 21 20:47:51 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 21 Jun 1999 21:47:51 +0200
Subject: [XML-SIG] ann: XML RPC client library for Python
Message-ID: <01ab01bebc1f$b88e63a0$f29b12c2@pythonware.com>

The xmlrpclib module is a client-side implementation
of Userland's XML-RPC protocol (www.xmlrpc.com).
This protocol allows you to transfer data between
Python environments and applications written in for
example Java and Perl.  It it also fully supported by
Userland's Frontier application, of course.  Upcoming
versions of Zope also speak XML RPC; see
http://linux.userland.com/stories/storyReader$18
for more information.

This release (0.9.8) uses the sgmlop XML parser if
possible.  With that parser in place, the XML-RPC
packet decoder is up to 20 times faster than before.

This release also includes sample XML-RPC servers
based on SocketServer and Medusa.
 
Get your copy from:
    http://www.pythonware.com/products/xmlrpc
 
The most recent version of sgmlop can be downloaded
from:
    http://www.pythonware.com/madscientist

</F>
 
 <P><A HREF="http://www.pythonware.com/products/xmlrpc/">xmlrpclib</A>
- XML RPC client library for Python (21-Jun-99) 


From akuchlin@mems-exchange.org  Tue Jun 22 16:37:56 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 22 Jun 1999 11:37:56 -0400 (EDT)
Subject: [XML-SIG] New CVS server, etc.
Message-ID: <199906221537.LAA22882@amarok.cnri.reston.va.us>

I've finally gotten around to actually informing people of the new CVS 
server.  (This, some 2 weeks after Greg Stein actually set it up...)

The new anonymous CVS server is at:
      :pserver:anoncvs@cvs.lyra.org:/home/cvsroot

Set your CVSROOT environment variable to this, or use the -d flag to
specify the server.  Consult the anonymous CVS Web page at
http://www.python.org/sigs/xml-sig/status.html for detailed
instructions on checking out the development tree.

Some of you will have received accounts and passwords for write
access.  You can simply check out a copy of the tree under your
account, and then begin making modifications.

There's a mailing list for check-in messages, xml-checkins@python.org:
anyone can join it at:
       http://www.python.org/mailman/listinfo/xml-checkins/

You don't need to have write access to the tree to read the checkin
mailing list.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Autumn, to me the most congenial of seasons: the University, to me the most
congenial of lives.
    -- Robertson Davies, _The Rebel Angels_


From bottoni@cadlab.it  Wed Jun 23 08:06:11 1999
From: bottoni@cadlab.it (Alessandro Bottoni)
Date: Wed, 23 Jun 1999 09:06:11 +0200
Subject: [XML-SIG] (no subject)
Message-ID: <004401bebd46$e036ca00$1f2b2bc1@cadlab.it>

This is a multi-part message in MIME format.

------=_NextPart_000_0041_01BEBD57.A37809B0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

unsubscribe

------=_NextPart_000_0041_01BEBD57.A37809B0
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type>
<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>unsubscribe</FONT></DIV></BODY></HTML>

------=_NextPart_000_0041_01BEBD57.A37809B0--


From danda@netscape.com  Wed Jun 23 09:51:14 1999
From: danda@netscape.com (Dan Libby)
Date: Wed, 23 Jun 1999 01:51:14 -0700
Subject: [XML-SIG] More entity stuff
References: <004401bebd46$e036ca00$1f2b2bc1@cadlab.it>
Message-ID: <3770A002.541C20F2@netscape.com>

This is a multi-part message in MIME format.
--------------AD18C212066C00108106D028
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Okay, so I have a DTD with a bunch of entities copied from the html 3.2
dtd.  They look like this:

<!ENTITY cent "&#162;">

When this is run through xmlproc (xmlval), the entities are ignored.
sort of.  If I change "&#162;" to "hello", then hello gets spit out.
Further, if I change it to "&amp;#162;" then "#162" gets spit out.  This
is actually okay with me... I'm just trying to preserve the entity for a
browser's use anyway.  However, it seems like a weird DTD.

So my question: Is this a bug in the dtd parser, or is this correct
behavior?  If the latter, does my DTD hack seem like the right thing to
do?

thx.

-dan

Alessandro Bottoni wrote:

> unsubscribe

--------------AD18C212066C00108106D028
Content-Type: text/x-vcard; charset=us-ascii;
 name="danda.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Dan Libby
Content-Disposition: attachment;
 filename="danda.vcf"

begin:vcard 
n:Libby;Dan
x-mozilla-html:TRUE
org:Netscape Communications
adr:;;;Mountain View;CA;94043;USA
version:2.1
email;internet:danda@netscape.com
x-mozilla-cpt:;0
tel;home:650-964-5913
tel;work:650-937-2276
fn:Dan Libby
end:vcard

--------------AD18C212066C00108106D028--


From fredrik@pythonware.com  Wed Jun 23 11:01:40 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 23 Jun 1999 12:01:40 +0200
Subject: [XML-SIG] ann: newschannel maker
References: <wkd804jeme.fsf@ifi.uio.no>
Message-ID: <003801bebd5f$648345a0$f29b12c2@pythonware.com>

a while ago, Lars wrote:
> I sat down yesterday and had a look at RSS, a format for news
> headlines which is used by Slashdot, mozilla.org and Scripting News,
> among others. It was very simple (a bit too simple, in fact), so I sat
> down and made a simple RSS library and client in Python. This client
> produces a web page when it is run. (I run it from cron.)

potential news providers might be interested in
my little "newschannel" tool, available from:

http://www.pythonware.com/madscientist/

this tool reads an HTML document (from file or from a
site), and extracts news items marked with special <span>
tags. it then generates perfectly valid RDF and Scripting-
News 2.0 (*) news channel files.  see the README and
the mynews.py sample script for more information.

</F>
http://www.pythonware.com/news.rdf
http://www.pythonware.com/people/fredrik/news.rdf
(replace .rdf with .xml for scriptingnews versions)

*) the "fat" format supported by my.userland.com. see:
http://my.userland.com/stories/storyReader$11


From r.hooft@euromail.net  Wed Jun 23 15:51:11 1999
From: r.hooft@euromail.net (Rob Hooft)
Date: Wed, 23 Jun 1999 16:51:11 +0200 (MZT)
Subject: [XML-SIG] Bug in exception handling?
Message-ID: <14192.62559.488258.872507@octopus.chem.uu.nl>

I really have no clue where to start looking for the following
problem:

not well-formed
Traceback (innermost last):
  File "/usr/local/nonius/app/scripts/comparehkl.py", line 459, in ?
    reflections.Source(file1).SendTo(ref1.Reflection)
  File "/usr/local/nonius/app/interface/evaly.py", line 79, in SendTo
    parser.parseFile(projtls.myopen(self.filename,'r'))
  File "/usr/local/nonius/lib/python1.5/site-packages/xml/sax/drivers/drv_pyexpat.py", line 73, in parseFile
    self.__report_error()
  File "/usr/local/nonius/lib/python1.5/site-packages/xml/sax/drivers/drv_pyexpat.py", line 89, in __report_error
    self.err_handler.fatalError(saxlib.SAXParseException(msg,None,self))
  File "/usr/local/nonius/app/interface/evaly.py", line 20, in fatalError
    raise exception
xml.sax.saxlib.SAXParseExceptionzsh: segmentation fault  comparehkl final.y

The routine that is causing this is:

    def fatalError(self, exception):
        print exception.msg
        raise exception
   
How does this crash the python interpreter?

xml.sax.saxlib.SAXParseException
Program received signal SIGSEGV, Segmentation fault.
normal_updatePosition (enc=0x4020414c, ptr=0x4020c01c <Address 0x4020c01c out of bounds>, 
    end=0x4020c180 <Address 0x4020c180 out of bounds>, pos=0x81b26d8) at xmltok/xmltok_impl.c:1618
1618        switch (BYTE_TYPE(enc, ptr)) {
(gdb) where
#0  normal_updatePosition (enc=0x4020414c, ptr=0x4020c01c <Address 0x4020c01c out of bounds>, 
    end=0x4020c180 <Address 0x4020c180 out of bounds>, pos=0x81b26d8) at xmltok/xmltok_impl.c:1618
#1  0x401f0ffd in XML_GetCurrentLineNumber (parser=0x81b2590) at xmlparse/xmlparse.c:642
#2  0x401f028c in xmlparse_getattr (self=0x81b2420, name=0x810cd44 "ErrorLineNumber") at ./pyexpat.c:349
#3  0x806d60b in PyObject_GetAttrString (v=0x81b2420, name=0x810cd44 "ErrorLineNumber") at object.c:381
#4  0x806d729 in PyObject_GetAttr (v=0x81b2420, name=0x810cd30) at object.c:438
#5  0x80742ce in eval_code2 (co=0x819e6c8, globals=0x810eae8, locals=0x0, args=0x81057d8, argcount=1, 
    kws=0x81057dc, kwcount=0, defs=0x0, defcount=0, owner=0x81afc38) at ceval.c:1380
#6  0x80748bd in eval_code2 (co=0x817a548, globals=0x8180060, locals=0x0, args=0x8176e8c, argcount=1, 
    kws=0x8176e90, kwcount=0, defs=0x0, defcount=0, owner=0x81805e8) at ceval.c:1610
#7  0x80748bd in eval_code2 (co=0x817ab18, globals=0x8180060, locals=0x0, args=0x810d784, argcount=1, 
    kws=0x0, kwcount=0, defs=0x0, defcount=0, owner=0x81805e8) at ceval.c:1610
#8  0x8075d60 in call_function (func=0x8180650, arg=0x810d778, kw=0x0) at ceval.c:2481
#9  0x8075942 in PyEval_CallObjectWithKeywords (func=0x816a298, arg=0x0, kw=0x0) at ceval.c:2319
#10 0x806d33e in PyObject_Str (v=0x810e668) at object.c:260
#11 0x805bab3 in PyErr_PrintEx (set_sys_last_vars=1) at pythonrun.c:816
#12 0x805b646 in PyErr_Print () at pythonrun.c:667
#13 0x805b3cc in PyRun_SimpleFile (fp=0x8098578, filename=0xbffff924 "scripts/comparehkl.py")
    at pythonrun.c:572
#14 0x805b061 in PyRun_AnyFile (fp=0x8098578, filename=0xbffff924 "scripts/comparehkl.py")
    at pythonrun.c:450
#15 0x804ef11 in Py_Main (argc=4, argv=0xbffff7fc) at main.c:286
#16 0x804e9b2 in main (argc=4, argv=0xbffff7fc) at python.c:12

The worst is: if I use only the first 4886 lines of the file, the "not
well-formed" error message correctly reports the problem in line 7,
column 37 of the file, but if I include 4887 or more, I get the above
core dump. The 4887 line file is 131053 bytes, just under 128kB?

Can I do something to fix this?

Regards,

Rob Hooft.

-- 
=====   R.Hooft@EuroMail.net   http://www.xs4all.nl/~hooft/rob/  =====
=====   R&D, Nonius BV, Delft  http://www.nonius.nl/             =====
===== PGPid 0xFA19277D ========================== Use Linux! =========


From jack@oratrix.nl  Wed Jun 23 21:10:33 1999
From: jack@oratrix.nl (Jack Jansen)
Date: Wed, 23 Jun 1999 22:10:33 +0200
Subject: [XML-SIG] Bug in exception handling?
In-Reply-To: Message by r.hooft@euromail.net (Rob Hooft) ,
 Wed, 23 Jun 1999 16:51:11 +0200 (MZT) , <14192.62559.488258.872507@octopus.chem.uu.nl>
Message-ID: <19990623201038.1D9CF126BC4@oratrix.oratrix.nl>

Rob,
my first guess would be a mismatch in the Python build: if pyexpat is
compiled as a dynamic library it may have been linked against an older 
version of Python, or one of the "critical" build options (refcount
debugging and such) was different. This can also happen in statically
built Pythons, as the dependencies aren't fully specified. As a first
try I would do a "make clean" and rebuild the world.

If the problem persists my next guess would be a buffer overflow. The
"address" 0x4020c11f looks rather too much like ascii for my liking.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 


From larsga@ifi.uio.no  Wed Jun 23 23:04:04 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 24 Jun 1999 00:04:04 +0200
Subject: [XML-SIG] More entity stuff
In-Reply-To: <3770A002.541C20F2@netscape.com>
References: <004401bebd46$e036ca00$1f2b2bc1@cadlab.it> <3770A002.541C20F2@netscape.com>
Message-ID: <wkr9n2poa3.fsf@ifi.uio.no>

* Dan Libby
| 
| Okay, so I have a DTD with a bunch of entities copied from the html 3.2
| dtd.  They look like this:
| 
| <!ENTITY cent "&#162;">
| 
| When this is run through xmlproc (xmlval), the entities are ignored.
| sort of. 

This is a bug. I've seen it before, but thought I'd fixed it. The
trouble is that the entity is only one character long (after the
character reference is resolved) and that causes xmlproc to screw up
for some reason.

If you insert a space in the declaration (before or after) the
character reference the problem goes away.

This turned out to be a rather subtle problem and finding a solution
that passed the regression test in a satisfying way took a while.
The patches below seem correct, though.

Thanks for reporting this!


=== xmlproc.py
***************
*** 72,78 ****
      def do_parse(self):
  	"Does the actual parsing."
  	try:
! 	    while self.pos+1<self.datasize:
  		prepos=self.pos
  
  		if self.data[self.pos]=="<":
--- 72,78 ----
      def do_parse(self):
  	"Does the actual parsing."
  	try:
! 	    while self.pos<self.datasize:
  		prepos=self.pos
  
  		if self.data[self.pos]=="<":
***************
*** 437,443 ****
              
  	if ent.is_internal():
  	    self.push_entity(self.get_current_sysid(),ent.value)
! 	    self.do_parse()
  	    self.flush()
  	    self.pop_entity()
  	else:
--- 435,447 ----
              
  	if ent.is_internal():
  	    self.push_entity(self.get_current_sysid(),ent.value)
!             try:
!                 self.do_parse()
!             except OutOfDataException: # Ran out of data before done
!                 self.report_error(3001)
!             except IndexError:         # Ran out of data before done
!                 self.report_error(3001)
!             
  	    self.flush()
  	    self.pop_entity()
  	else:


=== xmlutils.py
***************
*** 116,121 ****
--- 116,122 ----
  	self.last_break=0
  	self.datasize=len(contents)
  	self.last_upd_pos=0
+         self.final=1
  
      def pop_entity(self):
  	"Skips out of the current entity and back to the previous one."

--Lars M.


From r.hooft@euromail.net  Thu Jun 24 12:19:36 1999
From: r.hooft@euromail.net (Rob Hooft)
Date: Thu, 24 Jun 1999 13:19:36 +0200 (MZT)
Subject: [XML-SIG] Bug in exception handling?
In-Reply-To: <19990623201038.1D9CF126BC4@oratrix.oratrix.nl>
References: <14192.62559.488258.872507@octopus.chem.uu.nl>
 <19990623201038.1D9CF126BC4@oratrix.oratrix.nl>
Message-ID: <14194.5192.835922.637094@octopus.chem.uu.nl>

>>>>> "JJ" == Jack Jansen <jack@oratrix.nl> writes:

 JJ> my first guess would be a mismatch in the Python build: if pyexpat is
 JJ> compiled as a dynamic library it may have been linked against an older 
 JJ> version of Python, or one of the "critical" build options (refcount
 JJ> debugging and such) was different. 

It doesn't look like that....

What I (by accident) did find is that it has something to do with
Refcounting: The current code (drv_pyexpat) looks like:

        if not self.parser.Parse(fileobj.read(),1):
            self.__report_error()

If I replace that by 
        buf=fileobj.read()
        if not self.parser.Parse(buf,1):
            self.__report_error()

The exception does not dump core.

The "by accident" I'm talking about is that I tried to eliminate the
"sax" layer from the code, because in the profile listing of a test
parse, the top routines were all in drv_pyexpat:

    21989    4.600    0.000    6.930    0.000 evaly.py:87(HandleReflection)
    21989    5.070    0.000    7.950    0.000 evaly.py:102(HandleEndReflection)
   117706    7.490    0.000    7.490    0.000 saxutils.py:86(__init__)
    21989    8.760    0.000   13.080    0.001 evaly.py:95(HandleIntensity)
    22733   10.130    0.000   16.400    0.001 evaly.py:90(HandleIndex)
   134166   12.920    0.000   12.920    0.000 saxutils.py:113(__getitem__)
   154259   14.020    0.000   14.020    0.000 evaly.py:55(characters)
   117706   14.190    0.000   22.140    0.000 evaly.py:63(endElement)
   117706   16.540    0.000   38.680    0.000 drv_pyexpat.py:45(endElement)
   117706   19.330    0.000   55.740    0.000 evaly.py:50(startElement)
   154259   28.090    0.000   42.110    0.000 drv_pyexpat.py:48(characters)
   117706   41.440    0.000  104.670    0.001 drv_pyexpat.py:38(startElement)
        1   47.530   47.530  232.990  232.990 drv_pyexpat.py:58(parseFile)

I think especially that:

    def startElement(self,name,attrs):
        at = {}
        for i in range(0, len(attrs), 2):
            at[attrs[i]] = attrs[i+1]
            
        self.doc_handler.startElement(name,saxutils.AttributeMap(at))

is very expensive, as I'm not normally using the attributes on most of
the elements. For me, a lazy version of AttributeMap would help a bit.
Bypassing sax altogether and using pyexpat directly reduces parsing
time with 40%. 45 seconds on a "moderately sized" file (some of my
clients have files that are going to be 20 times bigger still,
i.e. 60MB of XML) is still considerably long, so I'll need to speed it
up a bit more to make it really usable.

Regards,

Rob Hooft.

-- 
=====   R.Hooft@EuroMail.net   http://www.xs4all.nl/~hooft/rob/  =====
=====   R&D, Nonius BV, Delft  http://www.nonius.nl/             =====
===== PGPid 0xFA19277D ========================== Use Linux! =========


From fredrik@pythonware.com  Thu Jun 24 13:10:49 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Thu, 24 Jun 1999 14:10:49 +0200
Subject: [XML-SIG] Bug in exception handling?
References: <14192.62559.488258.872507@octopus.chem.uu.nl><19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl>
Message-ID: <000b01bebe3a$99e38d00$f29b12c2@secret.pythonware.com>

Rob Hooft <r.hooft@euromail.net> wrote:
> Bypassing sax altogether and using pyexpat directly reduces parsing
> time with 40%. 45 seconds on a "moderately sized" file (some of my
> clients have files that are going to be 20 times bigger still,
> i.e. 60MB of XML) is still considerably long, so I'll need to speed it
> up a bit more to make it really usable.

with a little luck, you might be able to use sgmlop instead
(it cannot handle all possible XML constructs yet, but it
might work on your material).

here's a simple benchmark, run on an old 200 MHz pentium
box, under NT:

> dir big.xml
99-06-24  13:47             62 078 532 big.xml
> python benchxml.py big.xml
sgmlop/null parser: 8.567 seconds; 7246131 bytes per second
sgmlop/dummy parser: 51.943 seconds; 1195134 bytes per second
^C

(didn't have time to wait for the standard xmllib
implementation to finish...)

in this test, the null parser defines no parser callbacks
at all, so it basically measures the time it takes sgmlop
to read the file from disk, and to split it into elements.
the dummy parser defines all python callbacks as empty
methods. as you see, it's quite expensive to call Python
methods from C.  if you're going to DO things with the
data, things get even worse... (but a few hundred kb's
per second on a similar box should be no problem).

get your copy from:
http://www.pythonware.com/madscientist/

</F>


From larsga@ifi.uio.no  Thu Jun 24 13:31:53 1999
From: larsga@ifi.uio.no (Lars Marius Garshol)
Date: 24 Jun 1999 14:31:53 +0200
Subject: [XML-SIG] Bug in exception handling?
In-Reply-To: <14194.5192.835922.637094@octopus.chem.uu.nl>
References: <14192.62559.488258.872507@octopus.chem.uu.nl> 	<19990623201038.1D9CF126BC4@oratrix.oratrix.nl> <14194.5192.835922.637094@octopus.chem.uu.nl>
Message-ID: <wkemj1pyo6.fsf@ifi.uio.no>

* Rob Hooft
| 
| What I (by accident) did find is that it has something to do with
| Refcounting: The current code (drv_pyexpat) looks like:
| 
|         if not self.parser.Parse(fileobj.read(),1):
|             self.__report_error()
| 
| If I replace that by 
|         buf=fileobj.read()
|         if not self.parser.Parse(buf,1):
|             self.__report_error()
| 
| The exception does not dump core.

Aha! Thanks for this observation. I've checked your patch into my
driver source now, so it will be in the next release.

Once I finish my thesis I'll get this SAX work back on the rails
again. A JPython-compatible version, easySAX and SAX2 are all hovering
over me, but there's no time to do them properly now, and I think it's
better not to do them at all than to not do them properly.

Hopefully it's only a matter of weeks. Hopefully.
 
| The "by accident" I'm talking about is that I tried to eliminate the
| "sax" layer from the code, because in the profile listing of a test
| parse, the top routines were all in drv_pyexpat:

This isn't as surprising as it might be. I think the best solution
would be to have the drivers for expat and sgmlop be written entirely
in C.

| I think especially that:
| 
|     def startElement(self,name,attrs):
|         at = {}
|         for i in range(0, len(attrs), 2):
|             at[attrs[i]] = attrs[i+1]
|             
|         self.doc_handler.startElement(name,saxutils.AttributeMap(at))
| 
| is very expensive, as I'm not normally using the attributes on most of
| the elements. For me, a lazy version of AttributeMap would help a bit.

I had some spare time while waiting for my advisor now, so I wrote one
up for you. It's been tested a little, but not 100%. It's at:

<URL: http://birk105.studby.uio.no/tmp/drv_pyexpat.py>

If you want an even lazier driver you can use this one:

class LazyExpatDriver(SAX_expat):

    def __init__(self):
        SAX_expat.__init__(self)
        self.map=LazyAttributeMap([])
        
    def startElement(self,name,attrs):
        self.map.list=attrs
        self.doc_handler.startElement(name,self.map)    

Feedback on speed differences between these three drivers (original,
the one on the web and the one in this post) would be interesting.

--Lars M.


From r.hooft@euromail.net  Thu Jun 24 15:16:43 1999
From: r.hooft@euromail.net (Rob Hooft)
Date: Thu, 24 Jun 1999 16:16:43 +0200 (MZT)
Subject: [XML-SIG] Bug in exception handling?
In-Reply-To: <wkemj1pyo6.fsf@ifi.uio.no>
References: <14192.62559.488258.872507@octopus.chem.uu.nl>
 <19990623201038.1D9CF126BC4@oratrix.oratrix.nl>
 <14194.5192.835922.637094@octopus.chem.uu.nl>
 <wkemj1pyo6.fsf@ifi.uio.no>
Message-ID: <14194.15819.566865.167990@octopus.chem.uu.nl>

>>>>> "LMG" == Lars Marius Garshol <larsga@ifi.uio.no> writes:

 LMG> Feedback on speed differences between these three drivers (original,
 LMG> the one on the web and the one in this post) would be interesting.

devel[445]cubic%% ls -l final.y
-rw-r--r--   1 hooft    hooft     3963562 Jun 24 13:04 final.y

My sax-less version:

   Reading reflection file... 43.05 seconds

The original:

   Reading reflection file... 74.37 seconds

The Web version (without activating the lazy code):

   Reading reflection file... 348.88 seconds

Oops. something else changed?

I made up a lazy version myself, using the old 0.10 version of the file, and
a lazy map that is a bit less lazy than the one you made up.

   Reading reflection file... 71.87 seconds

Conclusion: this is not the real problem, making up the dictionary is 
not so expensive in comparison with the rest of the SAX layer.
For completeness, here is my added code:

class LazyExpatDriver(SAX_expat):

    def startElement(self,name,attrs):
        self.doc_handler.startElement(name,LazyAttributeMap(attrs))    
            
# --- A lazy attribute map

# This avoids the costly conversion from a list to a hash table if the attribute
# list is not needed anywhere.

class LazyAttributeMap:
    """An implementation of AttributeList that takes an (attr,val) hash
    and uses it to implement the AttributeList interface."""    

    def __init__(self, list):
        self.lst=list
        self.map=None

    def _mkmap(self):
        self.map={}
        for i in range(0,len(self.lst),2):
            self.map[self.lst[i]]=self.lst[i+1]
    
    def getLength(self):
        return len(self.list()/2)
        
    def getName(self, i):
        if self.map is None:
            self._mkmap()
        try:
            return self.map.keys()[i]
        except IndexError,e:
            return None

    def getType(self, i):
        return "CDATA"

    def getValue(self, i):
        if self.map is None:
            self._mkmap()
        try:
            if type(i)==types.IntType:
                return self.map[self.getName(i)]
            else:
                return self.map[i]
        except KeyError,e:
            return None

    def __len__(self):
        return len(self.lst()/2)

    def __getitem__(self, key):
        if self.map is None:
            self._mkmap()
        if type(key)==types.IntType:
            return self.map.keys()[key]
        else:
            return self.map[key]

    def items(self):
        if self.map is None:
            self._mkmap()
        return self.map.items()
        
    def keys(self):
        if self.map is None:
            self._mkmap()
        return self.map.keys()

    def has_key(self,key):
        if self.map is None:
            self._mkmap()
        return self.map.has_key(key)

    def get(self, key, alternative):
        """Return the value associated with attribute name; if it is not
        available, then return the alternative."""
        if self.map is None:
            self._mkmap()
        return self.map.get(key, alternative)

# ---
        
def create_parser():
    #return SAX_expat() 
    return LazyExpatDriver()


From r.hooft@euromail.net  Thu Jun 24 14:09:18 1999
From: r.hooft@euromail.net (Rob Hooft)
Date: Thu, 24 Jun 1999 15:09:18 +0200 (MZT)
Subject: [XML-SIG] Bug in exception handling?
In-Reply-To: <wkemj1pyo6.fsf@ifi.uio.no>
References: <14192.62559.488258.872507@octopus.chem.uu.nl>
 <19990623201038.1D9CF126BC4@oratrix.oratrix.nl>
 <14194.5192.835922.637094@octopus.chem.uu.nl>
 <wkemj1pyo6.fsf@ifi.uio.no>
Message-ID: <14194.11774.643335.39150@octopus.chem.uu.nl>

>>>>> "LMG" == Lars Marius Garshol <larsga@ifi.uio.no> writes:

 LMG> * Rob Hooft
 LMG> | 
 LMG> | What I (by accident) did find is that it has something to do with
 LMG> | Refcounting: The current code (drv_pyexpat) looks like:
 LMG> | 
 LMG> |         if not self.parser.Parse(fileobj.read(),1):
 LMG> |             self.__report_error()
 LMG> | 
 LMG> | If I replace that by 
 LMG> |         buf=fileobj.read()
 LMG> |         if not self.parser.Parse(buf,1):
 LMG> |             self.__report_error()
 LMG> | 
 LMG> | The exception does not dump core.

 LMG> Aha! Thanks for this observation. I've checked your patch into my
 LMG> driver source now, so it will be in the next release.

Sounds like a hack to me. Shouldn't it be solved by INCREF'ing the buffer
somewhere in the C code to pyexpat? e.g. where the exception code makes
reference to the buffer? I didn't look at the code myself, so I don't
know whether it is particularly difficult to find.

It would also be nice if the pyexpat parser would report the currect
line number for a problem even if the file is parsed in pieces (you
may have noticed that I was talking about 60MB files before, it is
not really nice to suck those into a single string).

Rob Hooft.

-- 
=====   R.Hooft@EuroMail.net   http://www.xs4all.nl/~hooft/rob/  =====
=====   R&D, Nonius BV, Delft  http://www.nonius.nl/             =====
===== PGPid 0xFA19277D ========================== Use Linux! =========


From r.hooft@euromail.net  Thu Jun 24 13:37:23 1999
From: r.hooft@euromail.net (Rob Hooft)
Date: Thu, 24 Jun 1999 14:37:23 +0200 (MZT)
Subject: [XML-SIG] Bug in exception handling?
In-Reply-To: <000b01bebe3a$99e38d00$f29b12c2@secret.pythonware.com>
References: <14192.62559.488258.872507@octopus.chem.uu.nl>
 <19990623201038.1D9CF126BC4@oratrix.oratrix.nl>
 <14194.5192.835922.637094@octopus.chem.uu.nl>
 <000b01bebe3a$99e38d00$f29b12c2@secret.pythonware.com>
Message-ID: <14194.9859.228589.169512@octopus.chem.uu.nl>

>>>>> "FL" == Fredrik Lundh <fredrik@pythonware.com> writes:

 FL> Rob Hooft <r.hooft@euromail.net> wrote:

 >> Bypassing sax altogether and using pyexpat directly reduces parsing
 >> time with 40%. 45 seconds on a "moderately sized" file (some of my
 >> clients have files that are going to be 20 times bigger still,
 >> i.e. 60MB of XML) is still considerably long, so I'll need to speed it
 >> up a bit more to make it really usable.

 FL> with a little luck, you might be able to use sgmlop instead
 FL> (it cannot handle all possible XML constructs yet, but it
 FL> might work on your material).

 FL> here's a simple benchmark, run on an old 200 MHz pentium
 FL> box, under NT:

 >> dir big.xml
 FL> 99-06-24  13:47             62 078 532 big.xml
 >> python benchxml.py big.xml
 FL> sgmlop/null parser: 8.567 seconds; 7246131 bytes per second
 FL> sgmlop/dummy parser: 51.943 seconds; 1195134 bytes per second
 FL> ^C

I'm using a 200MHz pentium as well, but I think the biggest problem
is the kind of data I'm handling. It is mostly numerical. We're
still working on the DTD, but I can show you a typical fragment:

...
<REFLECTION NR="14" BATCH="1">
<INDEX H="-7" K="-3" L="7"/>
<INTENSITY I="8384.55" SIGMA="25.05"/>
<IMPACT HOR="-5.24" VER="-20.09" ROT="-163.146"/>
</REFLECTION>
<REFLECTION NR="15" BATCH="1">
<INDEX H="-9" K="-3" L="8"/>
<INTENSITY I="40.61" SIGMA="4.05"/>
<IMPACT HOR="0.608" VER="-23.893" ROT="-163.24"/>
<FLAG>
<WEAK/>
</FLAG>
</REFLECTION>
<REFLECTION NR="16" BATCH="1">
<INDEX H="-4" K="5" L="2"/>
<INTENSITY I="66.57" SIGMA="2.5"/>
<IMPACT HOR="-9.787" VER="10.048" ROT="-163.12"/>
</REFLECTION>
...

I think a large part of my time with any parser will be spent in
atof() and atoi().... I'll try sgmlop as soon as I can.

Rob

-- 
=====   R.Hooft@EuroMail.net   http://www.xs4all.nl/~hooft/rob/  =====
=====   R&D, Nonius BV, Delft  http://www.nonius.nl/             =====
===== PGPid 0xFA19277D ========================== Use Linux! =========


From paul@prescod.net  Mon Jun 28 17:00:50 1999
From: paul@prescod.net (Paul Prescod)
Date: Mon, 28 Jun 1999 12:00:50 -0400
Subject: [XML-SIG] [Fwd: Re: parsers for Palm?]
Message-ID: <37779C32.780A9134@prescod.net>

> Expat 1.1 added a compile-time option to allow a smaller (and slightly
> slower) parser.  With this option on Win32 it compiles into a single DLL
> that compresses to 23k.  Is that too large for Palm?
> 
> James

Wow. I didn't notice that Expat was so small now. 

I think that we should certainly move for Python 1.6 to include eXpat and
easysax. At compile time, Unix Python users could choose whether they want
small or fast. For Windows we could just make both DLLs available (though
only the small one would be built-in to the distribution). 23K for
something as significant as massively-accelarated XML seems like a small
price. Note that this 23k includes full Unicode support and is completely
Ansi C, just like Python. Also, I understand that it now supports internal
and external, general and parameter entities. In other words, almost
everything except validation!

Opinions?

 Paul Prescod