XML Considered Harmful

Sat Sep 25 17:39:29 EDT 2021

Michael,

I don't care what you choose. Whatever works is fine for an internal use.

But is the data scheme you share representative of your actual application?

>From what I see below, unless the number of "point" variables is not always
exactly four, the application might be handled well by any format that
handles rectangular data, perhaps even CSV.

You show a I mean anything like a data.frame can contain data columns like
p1,p2,p3,p4 and a categorical one like IHRcurve_name.

Or do you have a need for more variability such as an undetermined number of
similar units in ways that might require more flexibility or be more
efficient done another way?

MOST of the discussion I am seeing here seems peripheral to getting you what
you need for your situation and may require a learning curve to learn to use
properly. Are you planning on worrying about how to ship your data
encrypted, for example? Any file format you use for storage can presumably
be encrypted and send and decrypted if that matters.

So, yes, from an abstract standpoint we can discuss the merits of various
approaches. If it matters that humans can deal with your data in a file or
that it be able to be imported into a program like EXCEL, those are
considerations. But if not, there are quite a few relatively binary formats
where your program can save a snapshot of the data into a file and read it
back in next time. I often do that in another language that lets me share
variable including nested components such as the complex structures that
come out of a statistical analysis or the components needed to make one or
more graphs later. If you write the program that creates the darn things as
well as the one that later reads them back in, you can do what you want.

Or, did I miss something and others have already produced the data using
other tools, in which case you have to read it in at least once/ 

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On
Behalf Of Michael F. Stemper
Sent: Saturday, September 25, 2021 4:20 PM
To: python-list at python.org
Subject: Re: XML Considered Harmful

On 21/09/2021 13.12, Michael F. Stemper wrote:

> If XML is not the way to package data, what is the recommended 
> approach?

Well, there have been a lot of ideas put forth on this thread, many more
than I expected. I'd like to thank everyone who took the time to contribute.

Most of the reasons given for avoiding XML appear to be along the lines of
"XML has all of these different options that it supports."

However, it seems that I could ignore 99% of those things and just use a
teeny subset of its capabilities. For instance, if I modeled a fuel like
this:

   <Fuel name="Montana Sub-Bituminous">
     <uom>ton</uom>
     <price>21.96</price>
     <heat_content>18.2</heat_content>
   </Fuel>

and a generating unit like this:

   <Generator name="Skunk Creek 1">
     <IHRcurve name="normal">
       <point P="63" IHR="8.513"/>
       <point P="105" IHR="8.907"/>
       <point P="241" IHR="9.411"/>
       <point P="455" IHR="10.202"/>
     </IHRcurve>
     <IHRcurve name="constrained">
       <point P="63" IHR="8.514"/>
       <point P="103" IHR="9.022"/>
       <point P="223" IHR="9.511"/>
       <point P="415" IHR="10.102"/>
     </IHRcurve>
   </Generator>

why would the fact that I could have chosen, instead, to model the unit of
measure as an attribute of the fuel, or its name as a sub-element matter?
Once the modeling decision has been made, all of the decisions that might
have been would seem to be irrelevant.

Some years back, IEC's TC57 came up with CIM[1]. This nailed down a lot of
decisions. The fact that other decisions could have been made doesn't seem
to keep utilities from going forward with it as an enterprise-wide data
model.

My current interests are not anywhere so expansive, but it seems that the
situations are at least similar:
1. Look at an endless range of options for a data model.
2. Pick one.
3. Run with it.

To clearly state my (revised) question:

   Why does the existence of XML's many options cause a problem
   for my use case?

Other reactions:

Somebody pointed out that some approaches would require that I climb a
learning curve. That's appreciated, although learning new things is always
good.

NestedText looks cool, and a lot like YAML. Having not gotten around to
playing with YAML yet, I was surprised to learn that it tries to guess data
types. This sounds as if it could lead to the same type of problems that led
to the names of some genes being turned into dates.

It was suggested that I use an RDBMS, such as sqlite3, for the input data.
I've used sqlite3 for real-time data exchange between concurrently-running
programs. However, I don't see syntax like:

sqlite> INSERT INTO Fuels
    ...> (name,uom,price,heat_content)
    ...> VALUES ("Montana Sub-Bituminous", "ton", 21.96, 13.65);

as being nearly as readable as the XML that I've sketched above.
Yeah, I could write a program to do this, but that doesn't really change
anything, since I'd still need to get the data into the program.

(Changing a value would be even worse, requiring the dreaded UPDATE INTO
statement, instead of five seconds in vi.)

Many of the problems listed for CSV, which come from its lack of
standardization, seem similar to those given for XML. "Commas or tabs?" "How
are new-lines represented?" If I was to use CSV, I'd be able to just pick
answers. However, fitting hierarchical data into rows/columns just seems
wrong, so I doubt that I'll end up going that way.

As far as disambiguating authors, I believe that most journals are now
expecting an ORCID[2] (which doesn't help with papers published before that
came around).

As far as use of XML to store program state, I wouldn't ever consider that.
As noted above, I've used an RDBMS to do so.
It handles all of the concurrency issues for me. The current use case is
specifically for raw, static input.

Fascinating to find out that XML was originally designed to mark up text,
especially legal text.

It was nice to be reminded of what Matt Parker looked like when he had hair.

[1] <https://en.wikipedia.org/wiki/Common_Information_Model_(electricity)>
[2] <https://orcid.org/>
--
Michael F. Stemper
Psalm 82:3-4
--
https://mail.python.org/mailman/listinfo/python-list