2to3, str, and basestring

Sat Sep 7 15:31:07 EDT 2019

2to3 converts syntactically valid 2.x code to syntactically valid 3.x 
code.  It cannot, however, guarantee semantic correctness.  A particular 
problem is that str is semantically ambiguous in 2.x, as it is used both 
for text encoded as bytes and binary data.

To resolve the ambiguity for conversions to 3.x, 2.6 introduced 'bytes' 
as a synonym for 'str'. The intention is that one use 'bytes' to create 
or refer to 2.x bytes that should remain bytes in 3.x and use 'str' to 
create or refer to 2.x text bytes that should become or will be unicode 
in 3.x.  3.x and hence 2to3 *assume* that one is using 'bytes' and 'str' 
this way, so that 'unicode' becomes an unneeded synonym for 'str' and 
2to3 changes 'unicode' to 'str'.  If one does not use 'str' and 'bytes' 
as intended, 2to3 may produce semantically different code.

2.3 introduced abstract superclass 'basestring', which can be viewed as 
Union(unicode, str).  "isinstance(value, basestring)" is defined as 
"isinstance(value, (unicode, str))"  I believe the intended meaning was 
'text, whether unicode or encoded bytes'.  Certainly, any code following
   if isinstance(value, basestring):
would likely only make sense if that were true.

In any case, after 2.6, one should only use 'basestring' when the 'str' 
part has its restricted meaning of 'unicode in 3.x'.  "(unicode, bytes)" 
is semantically different from "basestring" and "(unicode, str)" when 
used in isinstance.  2to3 converts then to "(std, bytes)", 'str', and 
'(str, str)' (the same as 'str' when used in isinstance).  If one uses 
'basestring' when one means '(unicode, bytes)', 2to3 may produce 
semantically different code.

Example based on https://bugs.python.org/issue38003:

if isinstance(value, basestring):
     if not isinstance(value, unicode):
         value = value.decode(encoding)
     process_text(value)
else:
     process_nontext(value)

2to3 produces

if isinstance(value, str):
     if not isinstance(value, str):
         value = value.decode(encoding)
     process_text(value)
else:
     process_nontext(value)

If, in 3.x, value is always unicode, then the inner conditional is dead 
and can be removed.  But if, in 3.x, value might be byte-encoded text, 
it will not be decoded and the code is wrong.  Fixes:

1. Instead of decoding value after the check, do it before the check.  I 
think this is best for new code.

if isinstance(value, bytes):
     value = value.decode(encoding)
...
if isinstance(value, unicode):
     process_text(value)
else:
     process_nontext(value)

2. Replace 'basestring' with '(unicode, bytes)'.  This is easier with 
existing code.

if isinstance(value, basestring):
     if not isinstance(value, unicode):
         value = value.decode(encoding)
     process_text(value)
else:
     process_nontext(value)

(I believe but have not tested that) 2to3 produces correct 3.x code from 
either 1 or 2 after replacing 'unicode' with 'str'.

In both cases, the 'unicode' to 'str' replacement should result in 
correct 3.x code.

3. Edit Lib/lib2to3/fixes/fix_basestring.py to replace 'basestring' with 
'(str, bytes)' instead of 'str'.  This should be straightforward if one 
understands the ast format.

Note that 2to3 is not meant for 2&3 code using exception tricks and 
six/future imports.  Turning 2&3 code into idiomatic 3-only code is a 
separate subject.

-- 
Terry Jan Reedy