Making IDLE3 ignore non-BMP characters instead of throwing an exception?

eryk sun eryksun at gmail.com
Mon Oct 17 14:20:28 EDT 2016


On Mon, Oct 17, 2016 at 2:20 PM, Adam Funk <a24061 at ducksburg.com> wrote:
> I'm using IDLE 3 (with python 3.5.2) to work interactively with
> Twitter data, which of course contains emojis.  Whenever the running
> program tries to print the text of a tweet with an emoji, it barfs
> this & stops running:
>
>   UnicodeEncodeError: 'UCS-2' codec can't encode characters in
>   position 102-102: Non-BMP character not supported in Tk
>
> Is there any way to set IDLE to ignore these characters (either drop
> them or replace them with something else) instead of throwing the
> exception?
>
> If not, what's the best way to strip them out of the string before
> printing?

You can patch print() to transcode non-BMP characters as surrogate
pairs. For example:

    import builtins

    def print_ucs2(*args, print=builtins.print, **kwds):
        args2 = []
        for a in args:
            a = str(a)
            if max(a) > '\uffff':
                b = a.encode('utf-16le', 'surrogatepass')
                chars = [b[i:i+2].decode('utf-16le', 'surrogatepass')
                         for i in range(0, len(b), 2)]
                a = ''.join(chars)
            args2.append(a)
        print(*args2, **kwds)

    builtins._print = builtins.print
    builtins.print = print_ucs2

On Windows this should allow printing non-BMP characters such as
emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a
pair of empty boxes. If you're not using Windows you can modify this
to print something else for non-BMP characters, such as a replacement
character or \U literals.



More information about the Python-list mailing list