[Python-ideas] PEP 426, YAML in the stdlib and implementation discovery

Stefan Drees stefan at drees.name
Wed Jun 5 08:25:50 CEST 2013


On 04.06.13 21:57, Vinay Sajip wrote:
> Philipp A. <flying-sheep at ...> writes:
>
>> PyYAML might not implement YAML 1.2 fully on paper, but the most useful
>> part of 1.2 (parsing arbitrary JSON) works flawlessly.
>
> Does it? What about this issue?
>
> https://bitbucket.org/xi/pyyaml/issue/11/valid-json-not-being-loaded ...

if "TL;DR":
     summary()

parsing arbitrary JSON is not guaranteed by[1] "the spec" (version 1.2, 
section 1.3, third paragraph). There I read wording like eg. """YAML can 
therefore be viewed as a natural superset of JSON, offering improved 
human readability and a more complete information model. This is also 
the case in practice; every JSON file is also a valid YAML file.[...]""" 
and even states, that the only issue might be """JSON's RFC4627 requires 
that mappings keys merely “SHOULD” be unique, while YAML insists they 
“MUST” be. Technically, YAML therefore complies with the JSON spec, 
choosing to treat duplicates as an error. In practice, since JSON is 
silent on the semantics of such duplicates, the only portable JSON files 
are those with unique keys, which are therefore valid YAML files. """ 
(4th paragraph ibid).

So the first sentence might even match perl "can [...] be viewed" as 
python :-) and the second (as the 4th paragraph) is in error, as JSON 
allows insertion of backslash in double quoted string value and 
associates no meaning with it, but YAML (read on) does!

So arbitrary JSON should cover the "russian doll" style of escaping data 
in serialization for some end-target (like the use case in the ticket, 
where some client likes slashes to be prepended by a backslash) are 
brittle at best.

Here YAML spec is very clear, as it uses the C-escape characters.

C.f. somewhere in section 2.3 "The double-quoted style provides escape 
sequences." Where escape sequences in YAML are explained in section 5.7, 
that starts with """ All non-printable characters must be escaped. YAML 
escape sequences use the “\” notation common to most modern computer 
languages. Each escape sequence must be parsed into the appropriate 
Unicode character. The original escape sequence is a presentation detail 
and must not be used to convey content information.

Note that escape sequences are only interpreted in double-quoted 
scalars. In all other scalar styles, the “\” character has no special 
meaning and non-printable characters are not available. """

and continues with """ YAML escape sequences are a superset of C’s 
escape sequences:"""

As the JSON of the ticket is {"key": "hi\/there"} this will not work in 
YAML as specified (in the relevant escape sequence section, as "\/" will 
not find the target Unicode character to replace it.

This is not a PyYAML or libyaml problem. Consider the following C program:

main(){char a[] = "\/";}

compiling will not work, as the compiler catches the error: unknown 
escape sequence '\/'

This is where PyYAML (and libyaml) is correct in throwing an error, as 
the spec is mandating escape sequences (and there interpretation).
The above mentioned 3rd paragraph claiming the JSON - YAML relation as 
automatic, as long as the keys are uniqueis wrong and should be 
corrected, in aversion 1.3 or as errata to 1.2 (while I would prefer the 
former, as this is IMO a nasty and irritating inconsistency).

References:

[1]: http://www.yaml.org/spec/1.2/spec.html


def summary():
     """If the post is too long, summarize."""
     print('YAML v1.2 is inconsistent and')
     print('can't parse \/ in a double quoted string')

All the best,
Stefan.



More information about the Python-ideas mailing list