XML Considered Harmful

Michael F. Stemper michael.stemper at gmail.com
Sat Sep 25 16:20:19 EDT 2021


On 21/09/2021 13.12, Michael F. Stemper wrote:

> If XML is not the way to package data, what is the recommended
> approach?

Well, there have been a lot of ideas put forth on this thread,
many more than I expected. I'd like to thank everyone who
took the time to contribute.

Most of the reasons given for avoiding XML appear to be along
the lines of "XML has all of these different options that it
supports."

However, it seems that I could ignore 99% of those things and
just use a teeny subset of its capabilities. For instance, if
I modeled a fuel like this:

   <Fuel name="Montana Sub-Bituminous">
     <uom>ton</uom>
     <price>21.96</price>
     <heat_content>18.2</heat_content>
   </Fuel>

and a generating unit like this:

   <Generator name="Skunk Creek 1">
     <IHRcurve name="normal">
       <point P="63" IHR="8.513"/>
       <point P="105" IHR="8.907"/>
       <point P="241" IHR="9.411"/>
       <point P="455" IHR="10.202"/>
     </IHRcurve>
     <IHRcurve name="constrained">
       <point P="63" IHR="8.514"/>
       <point P="103" IHR="9.022"/>
       <point P="223" IHR="9.511"/>
       <point P="415" IHR="10.102"/>
     </IHRcurve>
   </Generator>

why would the fact that I could have chosen, instead, to model
the unit of measure as an attribute of the fuel, or its name
as a sub-element matter? Once the modeling decision has been
made, all of the decisions that might have been would seem to
be irrelevant.

Some years back, IEC's TC57 came up with CIM[1]. This nailed down
a lot of decisions. The fact that other decisions could have been
made doesn't seem to keep utilities from going forward with it as
an enterprise-wide data model.

My current interests are not anywhere so expansive, but it seems
that the situations are at least similar:
1. Look at an endless range of options for a data model.
2. Pick one.
3. Run with it.

To clearly state my (revised) question:

   Why does the existence of XML's many options cause a problem
   for my use case?


Other reactions:

Somebody pointed out that some approaches would require that I
climb a learning curve. That's appreciated, although learning
new things is always good.

NestedText looks cool, and a lot like YAML. Having not gotten
around to playing with YAML yet, I was surprised to learn that it
tries to guess data types. This sounds as if it could lead to the
same type of problems that led to the names of some genes being
turned into dates.

It was suggested that I use an RDBMS, such as sqlite3, for the
input data. I've used sqlite3 for real-time data exchange between
concurrently-running programs. However, I don't see syntax like:

sqlite> INSERT INTO Fuels
    ...> (name,uom,price,heat_content)
    ...> VALUES ("Montana Sub-Bituminous", "ton", 21.96, 13.65);

as being nearly as readable as the XML that I've sketched above.
Yeah, I could write a program to do this, but that doesn't really
change anything, since I'd still need to get the data into the
program.

(Changing a value would be even worse, requiring the dreaded
UPDATE INTO statement, instead of five seconds in vi.)

Many of the problems listed for CSV, which come from its lack of
standardization, seem similar to those given for XML. "Commas
or tabs?" "How are new-lines represented?" If I was to use CSV,
I'd be able to just pick answers. However, fitting hierarchical
data into rows/columns just seems wrong, so I doubt that I'll
end up going that way.

As far as disambiguating authors, I believe that most journals
are now expecting an ORCID[2] (which doesn't help with papers
published before that came around).

As far as use of XML to store program state, I wouldn't ever
consider that. As noted above, I've used an RDBMS to do so.
It handles all of the concurrency issues for me. The current use
case is specifically for raw, static input.

Fascinating to find out that XML was originally designed to
mark up text, especially legal text.

It was nice to be reminded of what Matt Parker looked like when
he had hair.


[1] <https://en.wikipedia.org/wiki/Common_Information_Model_(electricity)>
[2] <https://orcid.org/>
-- 
Michael F. Stemper
Psalm 82:3-4


More information about the Python-list mailing list