[Python-ideas] Add str.bmp() to only expand non-BMP chars, for tkinter use

Terry Reedy tjreedy at udel.edu
Mon Mar 16 00:09:39 CET 2015


3.x comes with builtin ascii(obj) to return a string representation of 
obj that only has ascii characters.

 >>> s = 'a\xaa\ua000\U0001a000'
 >>> len(s)
4
 >>> sa = ascii(s)
 >>> sa, len(sa)
("'a\\xaa\\ua000\\U0001a000'", 23)

This allows any string to be printed on even a minimal ascii terminal.

Python also comes with the tkinter interface to tk.  Tk widgets are not 
limited to ascii but support the full BMP subset of unicode.  (This is 
better than Windows consoles limited by codepages.)  Thus, for use with 
tkinter, ascii() has two faults: it adds a quote at beginning and end of 
the string (like repr); it expands too much.

I looked at repr, which expands less, but it seems to be buggy in that 
it is not consistent in its handling of non-BMP chars.

 >>> s1 = 'a\xaa\ua000\U00011000'
 >>> sa = 'a\xaa\ua000\U0001a000'
 >>> s1r = repr(s1); len(s1r)
6
 >>> sar = repr(sa); len(sar)
15

'\U0001a000' gets expanded, and can be printed.
'\U00011000' does not, and cannot be consistently printed.

 >>> s1r  # only works at >>> prompt
"'a\xaa\ua000\U00011000'"
 >>> print(s1r)  # required in programs
Traceback (most recent call last):
   File "<pyshell#43>", line 1, in <module>
     print(s1r)
   File "C:\Programs\Python34\lib\idlelib\PyShell.py", line 1347, in write
     return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 
4-4: Non-BMP character not supported in Tk

Printing s1 or sa directly, by either means, gives the same error. 
(Since '>>> expr'  is supposed to be the same as 'print(expr)' the above 
difference puzzles me.)

Even if repr always worked as it does for '\U0001a000', there would 
still be the problem of the added quotes.  I therefore proposed the 
addition of a new str method, such as 's.bmp()', that returns s with all 
non-BMP chars, and only such chars, expanded.  Since strings (in 
CPython) are internally marked by 'kind', the method would just return s 
when no expansion is needed.  I presume it could otherwise re-use the 
expansion code already in repr.

Aside from tkinter programmers in general, this issue bites Idle in at 
least two ways.  Internally, filenames can contain non-BMP chars and 
Idle displays them in 3 places.
See http://bugs.python.org/issue23672.
Externally, Idle users sometimes want to print strings with non-BMP 
chars.  I believe the automatic use of .bmp() with console prints could 
be user selectable.  There have been issues about this on both our 
tracker and StackOverflow.

I believe that the use of non-BMP chars is becoming more common and can 
no longer be simply dismissed as too rare to worry about.  Telling 
Windows users that they are better off than if they use python directly, 
with the windows console, does not solve the inability to print any 
Python string.  This proposal would.

-- 
Terry Jan Reedy



More information about the Python-ideas mailing list