[Python-ideas] Stdlib YAML evolution (Was: PEP 426, YAML in the stdlib and implementation discovery)

Wed Nov 13 18:02:47 CET 2013

On Mon, Jun 3, 2013 at 2:53 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
> From: anatoly techtonik <techtonik at gmail.com>
> Sent: Sunday, June 2, 2013 11:23 AM
>
>
>>FWIW, I am +1 on for the ability to read YAML based configs Python
>>without dependencies, but waiting for several years is hard.
>
>
> With all due respect, I don't think you've read even a one-sentence description of YAML, so your entire post is nonsense.

I'll try to clarify my post, so that it will be clear for you. Please,
ask if something is unclear.

You're right. I am not reading specifications prior to using things.
What do I personally need from YAML?
These are examples of files I use daily:

http://tmuxp.readthedocs.org/en/latest/examples.html
http://code.google.com/p/rietveld/source/browse/app.yaml
https://github.com/agschwender/pilbox/blob/master/provisioning/playbook.yml
http://pastebin.com/RG7g260k (OpenXcom save format)

> The first sentence of the abstract says, "YAML… is a…data serialization language designed around the common native data types of agile programming languages." So, your idea that we shouldn't use it for serialization, and shouldn't map it to native Python data types, is ridiculous.

I don't care really about the abstract. I am a complaining user - not
a smart guy, who wrote the spec. So my thinking is
the following:

1. Neither of examples above is a persistence data format of
serialized native computer language data types.
   These are just nested mappings and lists. Strictly two dimensional
tree data structure, even for openXcom one.
   It is YAML, or as I said - subset of YAML, and that's why I
deliberately called this format "yamlish".

2. Regardless of any desire to use this proposal as an opportunity to
see the full YAML 1.2 spec implemented
    in Python stdlib, I am going to resist. I need work with *safe
data format*, which is "human friendly". And
    I put *safe format* over *serialization format*.

> You specifically suggest mapping YAML to XML so we can treat it as a structured document. From the "Relation to XML" section: "YAML is primarily a data serialization language. XML was… designed to support structured documentation."

Where? Oh, do you mean this one:

"The ideal output for the first version should be generic
tree structure with defined names for YAML elements. The tree that can be
represented as XML where these names are tags. "

It is not about "structured document", it is about "structured data format".

"tree that can be represented as XML" is not "XML tree". XML here is just
an example of structured nested format that everybody is aware of. I want
to say that this "tree structure" should be plain, and 1:1 mapping to XML
is necessary and sufficient requirement.

> You suggest that we shouldn't build all of YAML, just some bare-minimum subset that's good enough to get started. JSON is already _more_ than a bare-minimum subset of YAML, so we're already done.

I didn't know that JSON is not compatible with YAML. Still I am not
sure I understand how your argument of "JSON in not YAML" makes it
"done" with minimal implementation of YAML.

Module name - "yamlish" - defines its purpose as something that my
poor language skills can verbalize as "provide support for parsing and
writing files in formats, that are subsets of YAML used to store
generic user editable, not Python specific declarative data, such as
configurations, save files, settings etc.". Because I am not a CS
major, I can't describe exactly how to define common things between
examples I provided, how these examples are different from usual
programming language objects serialized into YAML. I feel that these
examples are "yamlish" and I am pretty much appreciate if somebody can
come up with proper *definition* of characteristics of the simple data
formats (which are still YAML) that give this feeling.

Such definition will greatly help to keep it moving in the right direction.

> But you'd also like some data-driven way to extend this. YAML has already designed exactly that. Once you have the core schema, you can add new types, and the syntax for those types is data-driven (although the semantics are really only defined in hand-wavy English and probably require code to implement, but I'm not sure how you expect your proposal to be any different, unless you're proposing something like XML Schema). So, either the necessary subset of YAML you want is the entire spec, or you want to do an equal amount of work building something just as complex but not actually YAML.

No, it is not data-driven support for extension in "yamlish" format.
It is data-driven process of writing parser for "yamlish" - you get
one example, parse it, get output, write test, get another, parse it,
get output, run previous test. "yamlish" format is only for common,
human understandable data files.

Perhaps expanding on the idea of "yamlish" format with development
process and with details of my "own data transformation theory" was
not a good idea, but it was the only chance to find a motivation to
write down the stuff. =) Sorry for the overload, and let me clarify
things a little. I proposed process for extending support of "yamlish"
parser to parse more backward-compatible "yamlish" data formats. There
is no mechanism to support conflicting formats, or formats that change
the output for existing stuff. That's it. There is no additional API
for full YAML, so no complexity involved with maintenance and support
of extra features or full YAML speccy.

`datatrans` framework I was speaking about is possible implementation
of the lib to transform 2D structures between different formats. You
know, data transformation process is all the same at some level. On
the level above I even can say that everything we do in CS is just
data transformation. It is not related to "yamlish" format definition.
The only thing that is important that "datatrans" enables many input
and many outputs of formats that can be represented in 2D annotated
(or generic) tree. It is not related to "yamlish".

> The idea of building a useful subset of YAML isn't a bad one. But the way to do that is to go through the features of YAML that JSON doesn't have, and decide which ones you want. For example, YAML with the core schema, but no aliases, no plain strings, and no explicit tags is basically JSON with indented block structure, raw strings, and useful synonyms for key constants (so you can write True instead of true). You could even carefully import a few useful definitions from the type library as long as they're unambiguous (e.g., timestamp). That gives you most of the advantages of YAML that don't bring any safety risks, and its output would be interpretable as full YAML, and it might be a little easier to implement than the full spec. But that has very little to do with your proposal. In particular, leaving out the data-driven features of YAML is what makes it safe and simple.

Now I feel that we basically thinking about the same things -
simplicity and safety.

I didn't read the spec, so I don't know what things are in core YAML schema, so
here you know much better than I what needs to be filtered out. My thought was
using examples to see what should be filtered out, because iterating over spec
will bring many more "useful" features that people with forward thinking might
want, but which may be harmful for keeping this small and simple.

I really like YAML brevity compared to JSON and other structured data
formats (tmuxp example page is a good one). Support for indented data format is
also natural for indented language. But it is hard to make format
right and not to
spoil it with overengineering.

About safety.

I believe that this "data-driven features of YAML" is the point of
confusion. I recall
that YAML spec provided some declarative mechanism for extensions. It is not it.
My data-driven approach is just "don't design anything upfront, use
existing widely
used data examples as a spec of data that needs to be parsed". And yes - I don't
need this YAML extensibility feature, which I too believe makes YAML unsafe.

I need YAML as a format of indented data in a text file. Nothing more. YAML
without "extra processing" that leads to potential hacks and execution
of unwanted
code. I just want to make sure that data format is safe. Currently,
Python stdlib
lacks a safe serialization format - docs are bleeding red of warnings without
specifying any alternatives. I like to call it "yamlish", because if
it is named YAML,
people will demand dynamism, OOPy "constructor/destructor" tricks, and sooner
or later the module users will be pwnd, like it happened with other
serialization
modules before. Therefore I don't want "serialization as a feature",
but I don't mind
against "serialization as a side effect" if it is compatible with good
intuitive API
AND improves the speed without sacrificing _clarity_ and safety.

_clarity_ here is the understanding that "there is no way that
'yamlsih' format can
be unsafe" at all times.

> Meanwhile, I think what you actually want is XSLT processors to convert YAML to and from XML. Fortunately, the YAML community is already working on that at http://www.yaml.org/xml. Then you don't need any new Python code at all; just convert your YAML to XML and use whichever XML library (in the stdlib or not), and you're done.

XSLT, that declarative turing complete language. I fed up it. Complexity and
performance ruin the beautiful theory. I think that
turing-completeness is a trap -
solving its gestalts gives a good feeling when you learn it, but it
has nothing to do
with the real world problems. XSLT processors hog memory AND slow at the same
time. XSLT debug is impossible, because process is obscure. I guess that it is
also easily exploitable to DoS. XSLT? Not anymore, thanks.

XML has only one advantage over all other formats - auto-discoverable validation
schemas. That's why it is still so popular.

FWIW.
Right now Python doesn't have any safe native data for structured data
- only linked
objects and references. Some time ago I tried to introduce solution to handling
structured data by proposing 2D (two dimensional) terminology with a
generic tree as
base type. But the post became too complicated, lacking pictures, and
I was unable
to support the communication.

I don't want this idea to find a rest in mailing list archives, so if
you know how to write
such minimal (and safe) parser (and fast) in Python (and
maintainable), please tell me.

If additional parser language is inevitable, maybe somebody knows of a
comparison
site similar that http://todomvc.com/ does for MV* frameworks.