Performance of int/long in Python 3

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Apr 1 04:14:16 EDT 2013


On Sun, 31 Mar 2013 22:33:45 -0700, rusi wrote:

> On Mar 31, 5:55 pm, Mark Lawrence <breamore... at yahoo.co.uk> wrote:
> 
> <snipped jmf's broken-record whine>
> 
>> I'm feeling very sorry for this horse, it's been flogged so often it's
>> down to bare bones.
> 
> While I am now joining the camp of those fed up with jmf's whining, I do
> wonder if we are shooting the messenger…

No. The trouble is that the messenger is shouting that the Unicode world 
is ending on December 21st 2012, and hasn't noticed that was over three 
months ago and the world didn't end.

 
[...]
>> OK, that leads to the next question.  Is there anyway I can (in Python
>> 2.7) detect when a string is not entirely in the BMP?  If I could find
>> all the non-BMP characters, I could replace them with U+FFFD
>> (REPLACEMENT CHARACTER) and life would be good (enough).

Of course you can do this, but you should not. If your input data 
includes character C, you should deal with character C and not just throw 
it away unnecessarily. That would be rude, and in Python 3.3 it should be 
unnecessary.

Although, since the person you are quoting is stuck in Python 2.7, it may 
be less bad than having to deal with potentially broken Unicode strings.


> Steven's:
>> But it means that if you're one of the 99.9% of users who mostly use
>> characters in the BMP, …

Yes. "Mostly" does not mean exclusively, and given (say) a billion 
computer users, that leaves about a million users who have significant 
need for non-BMP characters.

If you don't agree with my estimate, feel free to invent your own :-)


> And from http://www.tlg.uci.edu/~opoudjis/unicode/unicode_astral.html
>> The informal name for the supplementary planes of Unicode is "astral
>> planes", since (especially in the late '90s) their use seemed to be as
>> remote as the theosophical "great beyond". …

That was nearly two decades ago. Two decades ago, the idea that the 
entire computing world could standardize on a single character set, 
instead of having to deal with dozens of different "code pages", seemed 
as likely as people landing on the Moon seemed in 1940.

Today, the entire computing world has standardized on such a system, 
"code pages" (encodings) are mostly only needed for legacy data and 
shitty applications, but most implementations don't support the entire 
Unicode range. A couple of programming languages, including Pike and 
Python, support Unicode fully and correctly. Pike has never had the same 
high-profile as Python, but now that Python can support the entire 
Unicode range without broken surrogate support, maybe users of other 
languages will start to demand the same.


> So I really wonder: Is python losing more by supporting SMP with
> performance hit on BMP?

No.

As many people have demonstrated, both with code snippets and whole-
program benchmarks, Python 3.3 is *as fast* or *faster* than Python 3.2 
narrow builds. In practice, Python 3.3 saves enough memory by using 
sensible string implementations that real world software is faster in 
Python 3.3 than in 3.2.


> The problem as I see it is that a choice that is sufficiently skew is no
> more a straightforward choice. An example will illustrate:
> 
> I can choose to drive or not -- a choice. Statistics tell me that on
> average there are 3 fatalities every day; I am very concerned that I
> could get killed so I choose not to drive. Which neglects that there are
> a couple of million safe-drives at the same time as the '3 fatalities'

Clear as mud. What does this have to do with supporting Unicode?




-- 
Steven



More information about the Python-list mailing list