Performance of int/long in Python 3
Roy Smith
roy at panix.com
Mon Apr 1 08:15:53 EDT 2013
In article <515941d8$0$29967$c3e8da3$5496439d at news.astraweb.com>,
Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
> [...]
> >> OK, that leads to the next question. Is there anyway I can (in Python
> >> 2.7) detect when a string is not entirely in the BMP? If I could find
> >> all the non-BMP characters, I could replace them with U+FFFD
> >> (REPLACEMENT CHARACTER) and life would be good (enough).
>
> Of course you can do this, but you should not. If your input data
> includes character C, you should deal with character C and not just throw
> it away unnecessarily. That would be rude, and in Python 3.3 it should be
> unnecessary.
The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them. I can live with that.
Sometimes practicality trumps correctness.
It turns out, the problem is that the version of MySQL we're using
doesn't support non-BMP characters. Newer versions do (but you have to
declare the column to use the utf8bm4 character set). I could upgrade
to a newer MySQL version, but it's just not worth it.
Actually, I did try spinning up a 5.5 instance (one of the nice things
of being in the cloud) and experimented with that, but couldn't get it
to work there either. I'll admit that I didn't invest a huge amount of
effort to make that work before just writing this:
def bmp_filter(self, s):
"""Filter a unicode string to remove all non-BMP (basic
multilingual plane) characters. All such characters are
replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).
"""
if all(ord(c) <= 0xffff for c in s):
return s
else:
self.logger.warning("making %r BMP-clean", s)
bmp_chars = [(c if ord(c) <= 0xffff else u'\ufffd') for c in
s]
return ''.join(bmp_chars)
More information about the Python-list
mailing list