XML Considered Harmful

Wed Sep 22 10:40:48 EDT 2021

On 21/09/2021 16.21, Pete Forman wrote:
> "Michael F. Stemper" <michael.stemper at gmail.com> writes:
>> On 21/09/2021 13.49, alister wrote:
>>> On Tue, 21 Sep 2021 13:12:10 -0500, Michael F. Stemper wrote:
>> It's my own research, so I can give myself the data in any format that I
>> like.
>>
>>> as far as I can see the main issue with XML is bloat, it tries to do
>>> too many things & is a very verbose format, often the quantity of
>>> mark-up can easily exceed the data contained within it. other formats
>>> such a JSON & csv have far less overhead, although again not always
>>> suitable.
>>
>> I've heard of JSON, but never done anything with it.
> 
> Then you should certainly try to get a basic understanding of it. One
> thing JSON shares with XML is that it is best left to machines to
> produce and consume. Because both can be viewed in a text editor there
> is a common misconception that they are easy to edit. Not so, commas are
> a common bugbear in JSON and non-trivial edits in (XML unaware) text
> editors are tricky.

Okay, after playing around with the example in Lubanovic's book[1]
I've managed to create a dict of dicts of dicts and write it to a
json file. It seems to me that this is how json handles hierarchical
data. Is that understanding correct?

Is this then the process that I would use to create a *.json file
to provide data to my various programs? Copy and paste the current
hard-coded assignment statements into REPL, use json.dump(dict,fp)
to write it to a file, and then read the file into each program
with json.load(fp)? (Actually, I'd write a function to do that,
just as I would with XML.)

> Consider what overhead you should worry about. If you are concerned
> about file sizes then XML, JSON and CSV should all compress to a similar
> size.

Not a concern at all for my current application.

>> How does CSV handle hierarchical data? For instance, I have
>> generators[1], each of which has a name, a fuel and one or more
>> incremental heat rate curves. Each fuel has a name, UOM, heat content,
>> and price. Each incremental cost curve has a name, and a series of
>> ordered pairs (representing a piecewise linear curve).
>>
>> Can CSV files model this sort of situation?
> 
> The short answer is no. CSV files represent spreadsheet row-column
> values with nothing fancier such as formulas or other redirections.

Okay, that was what I suspected.

> CSV is quite good as a lowest common denominator exchange format. I say
> quite because I would characterize it by 8 attributes and you need to
> pick a dialect such as MS Excel which sets out what those are. XML and
> JSON are controlled much better. You can easily verify that you conform
> to those and guarantee that *any* conformant parser can read your
> content. XML is more powerful in that repect than JSON in that you can
> define and enforce schemas. In your case the fuel name, UOM, etc. can be
> validated with standard tools.

Yeah, validating against a DTD is pretty easy, since lxml.etree does all
of the work.

>   In JSON all that checking is entirely
> handled by the consuming program(s).
Well, the consumer's (almost) always going to need to do *some*
validation. For instance, as far as I can tell, a DTD can't specify
that there must be at least two of a particular item.

The designers of DTD seem to have taken the advice of MacLennan[2]:
   "The only reasonable numbers are zero, one, or infinity."

Which is great until you need to make sure that you have enough
points to define at least one line segment.

>>> As in all such cases it is a matter of choosing the most apropriate tool
>>> for the job in hand.
>>
>> Naturally. That's what I'm exploring.
> 
> You might also like to consider HDF5. It is targeted at large volumes of
> scientific data and its capabilities are well above what you need.

Yeah, I won't be looking at more than five or ten generators at most. A
small number is enough to confirm or refute the behavior that I'm
testing.

[1] _Introducing Python: Modern Computing in Simple Packages_,
Second Release, (c) 2015, Bill Lubanovic, O'Reilly Media, Inc.
[2] _Principles of Programming Languages: Design, Evaluation,
and Implementation_, Second Edition, (c) 1987, Bruce J. MacLennan,
Holt, Rinehart, & Winston
-- 
Michael F. Stemper
No animals were harmed in the composition of this message.