Performance of int/long in Python 3

Mon Apr 1 08:15:53 EDT 2013

In article <515941d8$0$29967$c3e8da3$5496439d at news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:

> [...]
> >> OK, that leads to the next question.  Is there anyway I can (in Python
> >> 2.7) detect when a string is not entirely in the BMP?  If I could find
> >> all the non-BMP characters, I could replace them with U+FFFD
> >> (REPLACEMENT CHARACTER) and life would be good (enough).
> 
> Of course you can do this, but you should not. If your input data 
> includes character C, you should deal with character C and not just throw 
> it away unnecessarily. That would be rude, and in Python 3.3 it should be 
> unnecessary.

The import job isn't done yet, but so far we've processed 116 million 
records and had to clean up four of them.  I can live with that.  
Sometimes practicality trumps correctness.

It turns out, the problem is that the version of MySQL we're using 
doesn't support non-BMP characters.  Newer versions do (but you have to 
declare the column to use the utf8bm4 character set).  I could upgrade 
to a newer MySQL version, but it's just not worth it.

Actually, I did try spinning up a 5.5 instance (one of the nice things 
of being in the cloud) and experimented with that, but couldn't get it 
to work there either.  I'll admit that I didn't invest a huge amount of 
effort to make that work before just writing this:

    def bmp_filter(self, s):
        """Filter a unicode string to remove all non-BMP (basic
         multilingual plane) characters.  All such characters are
         replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).

         """
        if all(ord(c) <= 0xffff for c in s):
            return s
        else:
            self.logger.warning("making %r BMP-clean", s)
            bmp_chars = [(c if ord(c) <= 0xffff else u'\ufffd') for c in 
s]
            return ''.join(bmp_chars)