[XML-SIG] Proposal: command-line interface to parser

Tue, 09 Jul 2002 07:54:37 +0000

>From: Uche Ogbuji <uche.ogbuji@fourthought.com>
>To: "Matt G." <matt_g_@hotmail.com>
>CC: xml-sig@python.org
>Subject: Re: [XML-SIG] Proposal: command-line interface to parser Date: 
>Mon, 08 Jul 2002 22:15:18 -0600
>
> > A quick search (i.e. 'find PyXML-0.7.1 -perm +111') doesn't turn up any
> > general-purpose applications of the sort I'm looking for - sorry if it's
> > there and I missed it (but why not 'chmod +x' it?).
> >
> > Anyhow, I think it'd be immensely useful to include a command-line tool 
>that
> > performs at least the following functions:
> >
> >   * XML validation - returns a nonzero error code and
> >     pretty/useful error message if validation fails
>
>The 4xml command in 4Suite CVS does this, except for the error code return,
>which is a good idea.  Do you have some suggestions for good error codes to
>use?

I don't care about actual values, beyond zero == success and nonzero == 
failure.  This is very important for writing scripts & makefiles.  I even 
have my prompt string configured to show me the return code of the last 
command (but then I'm the kind of nut who has his username, pwd, machine 
name, and the number of running and stopped jobs in his xterm titlebar).

Some of this should be fairly obvious, but here's my wish list, for return 
code behaviors:
* nonzero should always be returned, if the input is not well-formed
* nonzero should be returned, if validation is enabled, and the document
  fails to validate
* no output (i.e. XML written to either stdout or a file) should be 
produced,
  if the program executes with a nonzero error code.  If an output file is
  written, it should be deleted, before the program exits.
* a switch should exist, for treating warnings as errors.  By default,
  warnings should NOT cause the program to exit with a nonzero return code.
  If the switch to treat them as errors is provided, they would cause the
  program to (eventually) terminate, with a nonzero return code.

The point about not producing output is especially important, when used from 
a Makefile.  If this is not possible, then the exit message should probably 
even say "bad output written to stdout", so that the user knows to make sure 
that the output is cleaned up, if it's either redirected to a file or piped 
into any other commands.

BTW, I assume all your options are 'getopt'-style (i.e. multi-letter options 
begin with '--', while single letter, non-parameterized options use '-' and 
can be combined).

I have a neat python module, built on top of getopt.py, that lets you 
specify a short option, long option, and description.  It handles '--help' 
(though it gives you the opportunity to provide text to go before & after 
the options summary).  This allows you to centralize your management of 
option listing & documentation, and could even tie into an automated system 
to generate user documentation of your commandline interface.  If you're 
interested, check out sourceforge.net/projects/xml-extractor/ (you could 
either find it in lib/cmdopts.py, or just download xlf_to_wfx.tar.gz).

Furthermore, on the usability front, I believe that any output file argument 
should be supplied via a '--output' or '-o' option.  In fact, the only 
non-option file argument(s) should be input files (but taking an output 
file, this way (as 'tar' does), is particularly pernicious, since it could 
result in a file getting clobbered, if the user isn't careful or 
knowledgeable).

> >   * Listing URIs of all external entities referenced (defined
> >     would be okay, too, but only as an option)
>
>Doesn't do this yet, but if you post a feature request on the 4Suite SF
>feature request tracker, I can try to add it soon.

If you want to list only the entities that are actually referenced (which I 
think is the most reasonable behavior), then ENTITY and ENTITIES-type 
attributes make this slightly more complicated (though it shouldn't be much 
trouble, if you have a parsed representation of the DTD, lying around).  For 
output, the primary behavior should be to resolve the entities to their 
ultimate SYSTEM IDs, however it might be a nice feature to have the option 
of only listing the PUBLIC IDs or whichever is listed in the ENTITY 
definition.

>You're right that these features are very handy, which is why I added 4xml  
>:-)

Wow - this would be VERY cool!  Thanks for the reply & cooperation, and I'd 
be glad to do whatever testing or make any other contributions I can!!

I'm out of time, just now, but I'll checkout the CVS 4xml ASAP!

Matt Gruenke

_________________________________________________________________
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com