Performance of int/long in Python 3

MRAB python at mrabarnett.plus.com
Mon Apr 1 13:53:44 EDT 2013


On 01/04/2013 18:07, Steven D'Aprano wrote:
> On Mon, 01 Apr 2013 08:15:53 -0400, Roy Smith wrote:
>
>> In article <515941d8$0$29967$c3e8da3$5496439d at news.astraweb.com>,
>>  Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
>>
>>> [...]
>>> >> OK, that leads to the next question.  Is there anyway I can (in
>>> >> Python 2.7) detect when a string is not entirely in the BMP?  If I
>>> >> could find all the non-BMP characters, I could replace them with
>>> >> U+FFFD (REPLACEMENT CHARACTER) and life would be good (enough).
>>>
>>> Of course you can do this, but you should not. If your input data
>>> includes character C, you should deal with character C and not just
>>> throw it away unnecessarily. That would be rude, and in Python 3.3 it
>>> should be unnecessary.
>>
>> The import job isn't done yet, but so far we've processed 116 million
>> records and had to clean up four of them.  I can live with that.
>> Sometimes practicality trumps correctness.
>
> Well, true. It has to be said that few programming languages (and
> databases) make it easy to do the right thing. On the other hand, you're
> a programmer. Your job is to write correct code, not easy code.
>
>
>> It turns out, the problem is that the version of MySQL we're using
>
> Well there you go. Why don't you use a real database?
>
> http://www.postgresql.org/docs/9.2/static/multibyte.html
>
> :-)
>
> Postgresql has supported non-broken UTF-8 since at least version 8.1.
>
>
>> doesn't support non-BMP characters.  Newer versions do (but you have to
>> declare the column to use the utf8bm4 character set).  I could upgrade
>> to a newer MySQL version, but it's just not worth it.
>
> My brain just broke. So-called "UTF-8" in MySQL only includes up to a
> maximum of three-byte characters. There has *never* been a time where
> UTF-8 excluded four-byte characters. What were the developers thinking,
> arbitrarily cutting out support for 50% of UTF-8?
>
[snip]
50%? The BMP is one of 17 planes, so wouldn't that be 94%?




More information about the Python-list mailing list