[XML-SIG] Big Bug? (was:Pretty-printing DOM trees)

Sun, 24 Jan 1999 04:41:14 -0800 (PST)

On Sun, 24 Jan 1999, Christian Tismer wrote:
> Well, I agree. It should not encourage bad authoring.
> But I, as a complete newbie to a SIG which is very evolving,
> was kind of struggling with a lot of code, many parsers, and so
> on. I think, others will get into at least as much trouble
> as I had.

Well, that was simply because the errors weren't reported properly. That
can be fixed.

> Furthermore, the file which I wanted to inspect wasn't mine.
> What should I do if I'm confronted with foreign XML files
> which have some flaws, and the parser doesn't make it through
> it. The argument is fine for me, but in this case I have
> no chance.

Push back against where the file came from. What if somebody sent you a
bad executable? Do you try to correct it? What if they send a bad MSFT
Word file? Do you try to correct it? Makefiles with spaces instead of
tabs? crontab files with a missing column? etc. etc.

Well, the same for XML. If it is bad, then you ask for a correct one. Why
should XML be any different than the multitude of documents that you deal
with every day?

> For my custom work, it would be best to have a parser which
> *does* complain about an error, but also repairs easy cases
> like this. This gives me a chance to work with the file,
> inspect it and complain to my customer.
> This is easy after all since I now know enough of
> the XML package and can help myself.

By default, it should not correct it. That simply continues to encourage
poor XML authoring. As a programmer, if you want to try to auto-correct,
then okay, but I would not recommend it.

> The remaining qeustion is: How should faulty XML be handled
> at all? There are enough examples where you cannot simply
> reject the document. You need to read it.
> Does it make sense to think of a "correcting"
> parser which turns a bad document into something well-formed
> which can be inspected with an XML browser, together with
> some error-annotation tags?

No. No. No. No....

HTML is a huge mess because people started writing parsers that were
flexible and would correct things for you. Go try to write an HTML parser
that works against all the stuff out on the Internet. It is frightening
how difficult that is. There is just so much crap out there because people
said, "well, we can just correct that for them." Mismatched tags. Missing
quotes. Illegal characters. Missing close brackets. Simply crap.

With XML, the designers said, "No way. The document has to be correct, or
it gets rejected. Tough shit for the authors of bad documents."

Yes, I'm rather fascist on this one :-). I simply cannot condone or
recommend *any* allowance of flexibility in parsers. That will just lead
us back to the horrible situation that we are in now with HTML.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/