XML Considered Harmful

Peter J. Holzer hjp-python at hjp.at
Fri Sep 24 14:59:23 EDT 2021


On 2021-09-21 13:12:10 -0500, Michael F. Stemper wrote:
> I read this page right when I was about to write an XML parser
> to get data into the code for a research project I'm working on.
> It seems to me that XML is the right approach for this sort of
> thing, especially since the data is hierarchical in nature.
> 
> Does the advice on that page mean that I should find some other
> way to get data into my programs, or does it refer to some kind
> of misuse/abuse of XML for something that it wasn't designed
> for?
> 
> If XML is not the way to package data, what is the recommended
> approach?

There are a gazillion formats and depending on your needs one of them
might be perfect. Or you may have to define you own bespoke format (I
mean, nobody (except Matt Parker) tries to represent images or videos as
CSVs: There's PNG and JPEG and WEBP and H.264 and AV1 and whatever for
that).

Of the three formats discussed here my take is:

CSV: Good for tabular data of a single data type (strings). As soon as
there's a second data type (numbers, dates, ...) you leave standard
territory and are into "private agreements".

JSON: Has a few primitive data types (bool, number, string) and a two
compound types (list, dict(string -> any)). Still missing many
frequently used data types (e.g. dates) and has no standard way to
denote composite types. But its simple and if it's sufficient for your
needs, use it.

XML: Originally invented for text markup, and that shows. Can represent
different types (via tags), can define those types (via DTD and/or
schemas), can identify schemas in a globally-unique way and you can mix
them all in a single document (and there are tools available to validate
your files). But those features make it very complex (you almost
certainly don't want to write your own parser) and you really have to
understand the data model (especiall namespaces) to use it.


You can of course represent any data in any format if you jump through
enough hoops, but the real question is "does the data I have fit
naturally within the data model of the format I'm trying to use". If it
doesn't, look for something else. For me, CSV, JSON and XML form a
hierarchy where each can naturally represent all the data of its
predecessors, but not vice versa.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20210924/07563bdf/attachment.sig>


More information about the Python-list mailing list