XML Considered Harmful

Peter J. Holzer hjp-python at hjp.at
Sat Sep 25 06:46:55 EDT 2021


On 2021-09-24 23:32:47 -0000, Jon Ribbens via Python-list wrote:
> On 2021-09-24, Chris Angelico <rosuav at gmail.com> wrote:
> > On Sat, Sep 25, 2021 at 8:53 AM dn via Python-list
> ><python-list at python.org> wrote:
> >> On 25/09/2021 06.59, Peter J. Holzer wrote:
> >> > CSV: Good for tabular data of a single data type (strings). As soon as
> >> > there's a second data type (numbers, dates, ...) you leave standard
> >> > territory and are into "private agreements".
> 
> CSV is not good for strings, as there is no one specification of how to
> encode things like newlines and commas within the strings, so you may
> find that your CSV data transfer fails or even silently corrupts data.

Those two cases are actually pretty straightforward: Just enclose the
field in quotes.

Handling quotes is less standardized. I think doubling quotes is much more
common than an escape character, but I've certainly seen both.

But if you get down to it, the problems with CSV start at a much lower
level:

1) The encoding is not defined. These days UTF-8 (with our without BOM)
    is pretty common, but I still regularly get files in Windows-1252
    encoding and occasionally something else.

2) The record separator isn't defined. CRLF is most common, followed by
   LF. But just recently I got a file with CR (Does Eurostat still use
   some Macs with MacOS 9?)

3) The field separator isn't defined. Officially the format is known as
   "comma separated values", but in my neck of the woods it's actually
   semicolon-separated in the vast majority of cases.

So even for the most simple files there are three parameters the sender
and the receiver have to agree on.


> >> > JSON: Has a few primitive data types (bool, number, string) and a two
> >> > compound types (list, dict(string -> any)). Still missing many
> >> > frequently used data types (e.g. dates) and has no standard way to
> >> > denote composite types. But its simple and if it's sufficient for your
> >> > needs, use it.
> 
> JSON Schema provides a way to denote composite types.

I probably wasn't clear what I meant. In XML, every element has a tag,
which is basically its type. So by looking at an XML file (without
reference to a schema) you can tell what each element is. And a
validator can say something like "expected a 'product' or 'service'
element here but found a 'person'".

In JSON everything is just an object or a list. You may guess that an
object with a field "product_id" is a product, but is one with "name":
"Billy" a person or a piece of furniture?

I'm not familiar with JSON schema (I know that it exists and I've read a
tutorial or two but I've never used it in a real project), but as far as
I know it doesn't change that. It describes the structure of a JSON
document but it doesn't add type information to that document. So a
validator can at best guess what the malformed thing it just found was
supposed to be.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20210925/97103cbd/attachment.sig>


More information about the Python-list mailing list