[Python-bugs-list] [Bug #123634] Pickle broken on Unicode strings
noreply@sourceforge.net
noreply@sourceforge.net
Mon, 18 Dec 2000 18:10:58 -0800
Bug #123634, was updated on 2000-Nov-27 14:03
Here is a current snapshot of the bug.
Project: Python
Category: Unicode
Status: Closed
Resolution: Fixed
Bug Group: None
Priority: 5
Submitted by: tlau
Assigned to : gvanrossum
Summary: Pickle broken on Unicode strings
Details: Two one-liners that produce incorrect output:
>>> cPickle.loads(cPickle.dumps(u''))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
cPickle.UnpicklingError: pickle data was truncated
>>> cPickle.loads(cPickle.dumps(u'\u03b1 alpha\n\u03b2 beta'))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
cPickle.UnpicklingError: invalid load key, '\'.
The format of the Unicode string in the pickled representation is not escaped, as it is with regular strings. It should be. The latter bug occurs in both pickle and cPickle; the former is only a problem with cPickle.
Follow-Ups:
Date: 2000-Dec-18 18:10
By: gvanrossum
Comment:
Fixed in both pickle.py (rev. 1.41) and cPickle.py (rev. 2.54).
I've also checked in tests for these and similar endcases.
-------------------------------------------------------
Date: 2000-Nov-27 14:36
By: tlau
Comment:
One more comment: binary-format pickles are not affected, only text-format pickles. Thus the part of my patch that applies to the binary section of the save_unicode function should not be applied.
-------------------------------------------------------
Date: 2000-Nov-27 14:35
By: lemburg
Comment:
Some background (no time to fix this myself):
When I added the Unicode handlers, I wanted to avoid the
problems that the string dump mechanism has with
quoted strings. The encodings used either carry length information
(in binary mode: UTF-8) or do not include the \n character
(in ascii mode: raw-unicode-escape encoding).
Unfortunately, the raw-unicode-escape codec does not
escape the newline character which is used by pickle
to break the input into tokens....
Proposed fix: change the encoding to "unicode-escape"
which doesn't have this problem. This will break code,
but only code that is already broken :-/
-------------------------------------------------------
Date: 2000-Nov-27 14:20
By: tlau
Comment:
Here's my proposed patch to Lib/pickle.py (cPickle should be changed similarly):
--- /scratch/tlau/Python-2.0/Lib/pickle.py Mon Oct 16 14:49:51 2000
+++ pickle.py Mon Nov 27 14:07:01 2000
@@ -286,9 +286,9 @@
encoding = object.encode('utf-8')
l = len(encoding)
s = mdumps(l)[1:]
- self.write(BINUNICODE + s + encoding)
+ self.write(BINUNICODE + `s` + encoding)
else:
- self.write(UNICODE + object.encode('raw-unicode-escape') + '\n')
+ self.write(UNICODE + `object.encode('raw-unicode-escape')` + '\n')
memo_len = len(memo)
self.write(self.put(memo_len))
@@ -627,7 +627,12 @@
dispatch[BINSTRING] = load_binstring
def load_unicode(self):
- self.append(unicode(self.readline()[:-1],'raw-unicode-escape'))
+ rep = self.readline()[:-1]
+ if not self._is_string_secure(rep):
+ raise ValueError, "insecure string pickle"
+ rep = eval(rep,
+ {'__builtins__': {}}) # Let's be careful
+ self.append(unicode(rep, 'raw-unicode-escape'))
dispatch[UNICODE] = load_unicode
def load_binunicode(self):
-------------------------------------------------------
Date: 2000-Nov-27 14:14
By: gvanrossum
Comment:
Jim, do you have time to look into this?
-------------------------------------------------------
For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=123634&group_id=5470