[Python-bugs-list] [Bug #123634] Pickle broken on Unicode strings

noreply@sourceforge.net noreply@sourceforge.net
Mon, 18 Dec 2000 18:10:58 -0800


Bug #123634, was updated on 2000-Nov-27 14:03
Here is a current snapshot of the bug.

Project: Python
Category: Unicode
Status: Closed
Resolution: Fixed
Bug Group: None
Priority: 5
Submitted by: tlau
Assigned to : gvanrossum
Summary: Pickle broken on Unicode strings

Details: Two one-liners that produce incorrect output:

>>> cPickle.loads(cPickle.dumps(u''))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
cPickle.UnpicklingError: pickle data was truncated
>>> cPickle.loads(cPickle.dumps(u'\u03b1 alpha\n\u03b2 beta'))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
cPickle.UnpicklingError: invalid load key, '\'.

The format of the Unicode string in the pickled representation is not escaped, as it is with regular strings.  It should be.  The latter bug occurs in both pickle and cPickle; the former is only a problem with cPickle.

Follow-Ups:

Date: 2000-Dec-18 18:10
By: gvanrossum

Comment:
Fixed in both pickle.py (rev. 1.41) and cPickle.py (rev. 2.54).

I've also checked in tests for these and similar endcases.
-------------------------------------------------------

Date: 2000-Nov-27 14:36
By: tlau

Comment:
One more comment: binary-format pickles are not affected, only text-format pickles.  Thus the part of my patch that applies to the binary section of the save_unicode function should not be applied.
-------------------------------------------------------

Date: 2000-Nov-27 14:35
By: lemburg

Comment:
Some background (no time to fix this myself):

When I added the Unicode handlers, I wanted to avoid the
problems that the string dump mechanism has with
quoted strings. The encodings used either carry length information
(in binary mode: UTF-8) or do not include the \n character
(in ascii mode: raw-unicode-escape encoding). 

Unfortunately, the raw-unicode-escape codec does not
escape the newline character which is used by pickle
to break the input into tokens.... 

Proposed fix: change the encoding to "unicode-escape"
which doesn't have this problem. This will break code,
but only code that is already broken :-/

-------------------------------------------------------

Date: 2000-Nov-27 14:20
By: tlau

Comment:
Here's my proposed patch to Lib/pickle.py (cPickle should be changed similarly):

--- /scratch/tlau/Python-2.0/Lib/pickle.py      Mon Oct 16 14:49:51 2000
+++ pickle.py   Mon Nov 27 14:07:01 2000
@@ -286,9 +286,9 @@
             encoding = object.encode('utf-8')
             l = len(encoding)
             s = mdumps(l)[1:]
-            self.write(BINUNICODE + s + encoding)
+            self.write(BINUNICODE + `s` + encoding)
         else:
-            self.write(UNICODE + object.encode('raw-unicode-escape') + '\n')
+            self.write(UNICODE + `object.encode('raw-unicode-escape')` + '\n')
 
         memo_len = len(memo)
         self.write(self.put(memo_len))
@@ -627,7 +627,12 @@
     dispatch[BINSTRING] = load_binstring
 
     def load_unicode(self):
-        self.append(unicode(self.readline()[:-1],'raw-unicode-escape'))
+        rep = self.readline()[:-1]
+        if not self._is_string_secure(rep):
+            raise ValueError, "insecure string pickle"
+        rep = eval(rep,
+                   {'__builtins__': {}}) # Let's be careful
+        self.append(unicode(rep, 'raw-unicode-escape'))
     dispatch[UNICODE] = load_unicode
 
     def load_binunicode(self):

-------------------------------------------------------

Date: 2000-Nov-27 14:14
By: gvanrossum

Comment:
Jim, do you have time to look into this?
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=123634&group_id=5470