[Python-ideas] Add str.bmp() to only expand non-BMP chars, for tkinter use

Andrew Barnert abarnert at yahoo.com
Mon Mar 16 06:45:19 CET 2015


It seems like what you really want here is a new codec, like unicode-escape but only escaping non-BMP characters, not a new repr-like function.

On Mar 15, 2015, at 4:09 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> 
> 3.x comes with builtin ascii(obj) to return a string representation of obj that only has ascii characters.
> 
> >>> s = 'a\xaa\ua000\U0001a000'
> >>> len(s)
> 4
> >>> sa = ascii(s)
> >>> sa, len(sa)
> ("'a\\xaa\\ua000\\U0001a000'", 23)
> 
> This allows any string to be printed on even a minimal ascii terminal.
> 
> Python also comes with the tkinter interface to tk.  Tk widgets are not limited to ascii but support the full BMP subset of unicode.  (This is better than Windows consoles limited by codepages.)  Thus, for use with tkinter, ascii() has two faults: it adds a quote at beginning and end of the string (like repr); it expands too much.
> 
> I looked at repr, which expands less, but it seems to be buggy in that it is not consistent in its handling of non-BMP chars.
> 
> >>> s1 = 'a\xaa\ua000\U00011000'
> >>> sa = 'a\xaa\ua000\U0001a000'
> >>> s1r = repr(s1); len(s1r)
> 6
> >>> sar = repr(sa); len(sar)
> 15
> 
> '\U0001a000' gets expanded, and can be printed.
> '\U00011000' does not, and cannot be consistently printed.
> 
> >>> s1r  # only works at >>> prompt
> "'a\xaa\ua000\U00011000'"
> >>> print(s1r)  # required in programs
> Traceback (most recent call last):
>  File "<pyshell#43>", line 1, in <module>
>    print(s1r)
>  File "C:\Programs\Python34\lib\idlelib\PyShell.py", line 1347, in write
>    return self.shell.write(s, self.tags)
> UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 4-4: Non-BMP character not supported in Tk
> 
> Printing s1 or sa directly, by either means, gives the same error. (Since '>>> expr'  is supposed to be the same as 'print(expr)' the above difference puzzles me.)
> 
> Even if repr always worked as it does for '\U0001a000', there would still be the problem of the added quotes.  I therefore proposed the addition of a new str method, such as 's.bmp()', that returns s with all non-BMP chars, and only such chars, expanded.  Since strings (in CPython) are internally marked by 'kind', the method would just return s when no expansion is needed.  I presume it could otherwise re-use the expansion code already in repr.
> 
> Aside from tkinter programmers in general, this issue bites Idle in at least two ways.  Internally, filenames can contain non-BMP chars and Idle displays them in 3 places.
> See http://bugs.python.org/issue23672.
> Externally, Idle users sometimes want to print strings with non-BMP chars.  I believe the automatic use of .bmp() with console prints could be user selectable.  There have been issues about this on both our tracker and StackOverflow.
> 
> I believe that the use of non-BMP chars is becoming more common and can no longer be simply dismissed as too rare to worry about.  Telling Windows users that they are better off than if they use python directly, with the windows console, does not solve the inability to print any Python string.  This proposal would.
> 
> -- 
> Terry Jan Reedy
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/


More information about the Python-ideas mailing list