XML Considered Harmful

Avi Gross avigross at verizon.net
Thu Sep 23 17:26:44 EDT 2021


Can we agree that there are way more general ways to store data than
anything currently in common use and that in some ways, CSV and cousins like
TSV are a subset of the others in a sense? There are trees and arbitrary
graphs and many complex data structures often encountered while a program is
running as in-memory objects. Many are not trivial to store.

But some are if all you see is table-like constructs including matrices and
data.frames.

I mean any rectangular data format with umpteen rows and N columns can
trivially be stored in many other formats and especially when it allows some
columns to have NA values. The other format would simply have major
categories that contain components with one per column, and if missing,
represents an NA. Is there any reason JSON or XML cannot include the
contents of any CSV with headers and without loss of info?

Going the other way is harder. Note that a data.frame type of structure
often imposes restrictions on a CSV and requires everything in a column to
be of the same type, or coercible to a common type. (well, not always true
as in using list columns in R.)  But given some arbitrary structure in XML,
can you look at all possible labels and if it is not too complex, make a CSV
with one or more columns for every possible need? It can be a problem if say
a record for an Author allows multiple actual co-authors. Normal books may
let you get by with multiple columns (mostly containing an NA) with names
like author1, author2, author3, ...

But scientific papers seemingly allow oodles of authors and any time you
update the data, you may need yet another column. And, of course, processing
data where many columns have the same meaning is a bit of a pain. Data
structures can also often be nested multiple levels and at some point, CSV
is not a reasonable fit unless you play database games and make multiple
tables you can store and retrieve to make complex queries, as in many
relational database systems. Yes, each such table can be a CSV.

But if you give someone a hammer, they tend to stop using thumbtacks or
other tools. The real question is what kind of data makes good sense for an
application. If a nice rectangular format works, great. Even if not, the
Author problem above can fairly easily be handled by making the author
column something like a character string you compose as "Last1, First1;
Last2, First2; Last3, First3" and that fits fine in a CSV but can be taken
apart in your software if looking for any book by a particular author. Not
optimal, but a workaround I am sure is used.

But using the most abstract and complex storage method is very often
overkill and unless you are very good at it, may well be a fairly slow and
even error-prone way to solve a problem.

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On
Behalf Of Chris Angelico
Sent: Thursday, September 23, 2021 9:27 AM
To: Python <python-list at python.org>
Subject: Re: XML Considered Harmful

On Thu, Sep 23, 2021 at 10:55 PM Mats Wichmann <mats at wichmann.us> wrote:
>
> On 9/22/21 10:31, Dennis Lee Bieber wrote:
>
> >       If you control both the data generation and the data 
> > consumption, finding some format  ...
>
> This is really the key.  I rant at people seeming to believe that csv 
> is THE data interchange format, and it's about as bad as it gets at 
> that, if you have a choice.  xml is noisy but at least (potentially) 
> self-documenting, and ought to be able to recover from certain errors.
> The problem with csv is that a substantial chunk of the world seems to 
> live inside Excel, and so data is commonly both generated in csv so it 
> can be imported into excel and generated in csv as a result of 
> exporting from excel, so the parts often are *not* in your control.
>
> Sigh.

The only people who think that CSV is *the* format are people who habitually
live in spreadsheets. People who move data around the internet, from program
to program, are much more likely to assume that JSON is the sole format. Of
course, there is no single ultimate data interchange format, but JSON is a
lot closer to one than CSV is.

(Or to be more precise: any such thing as a "single ultimate data
interchange format" will be so generic that it isn't enough to define
everything. For instance, "a stream of bytes" is a universal data
interchange format, but that's not ultimately a very useful claim.)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list