Python 3.0 and repr

Mark Tolonen M8R-yfto6h at mailinator.com
Sun Sep 28 13:13:31 EDT 2008


I don't understand the behavior of the interpreter in Python 3.0.  I am 
working at a command prompt in Windows (US English), which has a terminal 
encoding of cp437.

In Python 2.5:

    Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit 
(Intel)] on win
    32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> x=u'\u5000'
    >>> x
    u'\u5000'

In Python 3.0:

    Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit 
(Intel)]
    on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> x='\u5000'
    >>> x
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "c:\dev\python30\lib\io.py", line 1486, in write
        b = encoder.encode(s)
      File "c:\dev\python30\lib\encodings\cp437.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u5000' in 
position
    1: character maps to <undefined>

Where I would have expected

    >>> x
    '\u5000'

Shouldn't a repr() of x work regardless of output encoding?  Another test:

    Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit 
(Intel)]
    on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> bytes(range(256)).decode('cp437')
    '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
    x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f 
!"#$%&\'()*+,-./0123456789:;<=>?@ABC
    DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7fÇüéâäàåçêëèïîìÄÅ
    ÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
    αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■\xa0'
    >>> bytes(range(256)).decode('cp437')[255]
    '\xa0'

Characters that cannot be displayed in cp437 are being escaped, such as 
0x00-0x1F, 0x7F, and 0xA0.  Even if I incorrectly decode a value, if the 
character exists in cp437, it is displayed:

    >>> bytes(range(256)).decode('latin-1')[255]
    'ÿ'

However, for a character that isn't supported by cp437, incorrectly decoded:

    >>> bytes(range(256)).decode('cp437')[254]
    '■'
    >>> bytes(range(256)).decode('latin-1')[254]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "c:\dev\python30\lib\io.py", line 1486, in write
        b = encoder.encode(s)
      File "c:\dev\python30\lib\encodings\cp437.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\xfe' in 
position 1:
     character maps to <undefined>

Why not display '\xfe' here?  It seems like this inconsistency would make it 
difficult to write things like doctests that weren't dependent on the 
tester's terminal.  It also makes it difficult to inspect variables without 
hex(ord(n)) on a character-by-character basis.  Maybe repr() should always 
display the ASCII representation with escapes for all other characters, 
especially considering the "repr() should produce output suitable for eval() 
when possible" rule.

What are others' opinions?  Any insight to this design decision?

-Mark





More information about the Python-list mailing list