[Tutor] Weird Unicode encode/decode errors in Python 2

Steven D'Aprano steve at pearwood.info
Sat Dec 8 19:02:47 EST 2018


This is not a request for help, but a demonstration of what can go wrong 
with text processing in Python 2.

Following up on the "Special characters" thread, one of the design flaws 
of Python 2 is that byte strings and text strings offer BOTH decode and 
encode methods, even though only one is meaningful in each case.[1]

- text strings are ENCODED to bytes;
- byte are DECODED to text strings.

One of the symptoms of getting it wrong is when you take a Unicode text 
string and encode/decode it but get an error from the *opposite* 
operation:


py> u'ä'.decode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in 
position 0: ordinal not in range(128)


Look at what happens: I try to DECODE a string, but get an ENCODE error. 
And even though I specified Latin 1 as the codec, Python uses ASCII. 
What is going on here?

Behind the scenes, the interpreter takes my text u'ä' (a Unicode string) 
and attempts to *encode* it to bytes first, using the default ASCII 
codec. That fails. Had it succeeded, it would have then attempted to 
*decode* those bytes using Latin 1.

Similarly:

py> b = u'ä'.encode('latin1')
py> print repr(b)
'\xe4'
py> b.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: 
ordinal not in range(128)


The error here is that I tried to encode a bunch of bytes, instead of 
decoding them. But the insidious thing about this error is if you are 
working with pure ASCII, it seems to work:

py> 'ascii'.encode('utf-16')
'\xff\xfea\x00s\x00c\x00i\x00i\x00'

That is, it *seems* to work because there's no error, but the result is 
pretty much meaningless: I *intended* to get a UTF-16 Unicode string, 
but instead I ended up with bytes just like I started with.

Python 3 fixes this bug magnet by removing the decode method from 
Unicode text strings, and the encode method from byte-strings.



[1] Technically this is not so, as there are codecs which can be used to 
convert bytes to bytes, or text to text. But the vast majority of common 
cases, codecs are used to convert bytes to text and vice versa. For the 
rare exception, we can use the "codecs" module.


-- 
Steve


More information about the Tutor mailing list