XML Considered Harmful

Avi Gross avigross at verizon.net
Thu Sep 23 17:59:41 EDT 2021


What you are describing Stephen, is what I meant by emulating a relational database with tables.

And, FYI, There is no guarantee that two authors with the same name will not be assumed to be the same person.

Besides the lack of any one official CSV format, there are oodles of features I have seen that are normally external to the CSV. For example, I have often read in data from a CSV or similar, where you could tell the software to consider a blank or 999 to mean NA and what denotes a line in the file to be ignored as a comment and whether a separator is a space or any combination of whitespace and what quotes something so say you can hide a comma and how to handle escapes and whether to skip blank lines and more.

Now a really good design might place some metadata into the file that can be used to set defaults for things like that or incorporate them into the format unambiguously. It might calculate the likely data type for various fields and store that in the metadata. So even if you stored rectangular data in a CSV file, perhaps the early lines would be in some format that can be read as comments and supply some info like the above.

Are any of the CSV variants more like that?

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On Behalf Of Stefan Ram
Sent: Thursday, September 23, 2021 5:43 PM
To: python-list at python.org
Subject: Re: XML Considered Harmful

"Avi Gross" <avigross at verizon.net> writes:
>But scientific papers seemingly allow oodles of authors and any time 
>you update the data, you may need yet another column.

  You can use three CSV files: papers, persons, and authors:

  papers.csv

1, "Is the accelerated expansion evidence of a change of signature?"

  persons.csv

1, Marc Mars

  authors.csv

1, 1

  I.e., paper 1 is authored by person 1.

  Now, when we learn that José M. M. Senovilla also is a
  co-author of "Is the accelerated expansion evidence of a
  forthcoming change of signature?", we do only have to add
  new rows, no new colums.

  papers.csv

1, "Is the accelerated expansion evidence of a change of signature?"

  persons.csv

1, "Marc Mars"
2, "José M. M. Senovilla"

  authors.csv

1, 1
1, 2

  The real problem with CSV is that there is no CSV.

  This is not a specific data language with a specific
  specification. Instead it is a vague designation for
  a plethora of CSV dialects, which usually dot not even
  have a specification. Compare this with XML. XML has
  a sole specification managed by the W3C.


--
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list