Re: Stack Overflow moderator “animuson”

Fri Jul 19 14:54:56 EDT 2013

Le mercredi 10 juillet 2013 11:00:23 UTC+2, Steven D'Aprano a écrit :
> On Wed, 10 Jul 2013 07:55:05 +0000, Mats Peterson wrote:
> 
> 
> 
> > A moderator who calls himself “animuson” on Stack Overflow doesn’t want
> 
> > to face the truth. He has deleted all my postings regarding Python
> 
> > regular expression matching being extremely slow compared to Perl.
> 
> 
> 
> That's by design. We don't want to make the same mistake as Perl, where 
> 
> every problem is solved by a regular expression:
> 
> 
> 
> http://neilk.net/blog/2000/06/01/abigails-regex-to-test-for-prime-numbers/
> 
> 
> 
> so we deliberately make regexes as slow as possible so that programmers 
> 
> will look for a better way to solve their problem. If you check the 
> 
> source code for the re engine, you'll find that for certain regexes, it 
> 
> busy-waits for anything up to 30 seconds at a time, deliberately wasting 
> 
> cycles.
> 
> 
> 
> The same with Unicode. We hate French people, you see, and so in an 
> 
> effort to drive everyone back to ASCII-only text, Python 3.3 introduces 
> 
> some memory optimizations that ensures that Unicode strings work 
> 
> correctly and are up to four times smaller than they used to be. You 
> 
> should get together with jmfauth, who has discovered our dastardly plot 
> 
> and keeps posting benchmarks showing how on carefully contrived micro-
> 
> benchmarks using a beta version of Python 3.3, non-ASCII string 
> 
> operations can be marginally slower than in 3.2.
> 
> 
> 
> 
> 
> > Additionally my account has been suspended for 7 days. Such a dickwad.
> 
> 
> 
> I cannot imagine why he would have done that.
> 
> 
> 
> 
> 
> -- 
> 
> Steven

This Flexible String Representation is a dream case study.
Attempting to optimize a subset of character is a non sense.

If you are a non-ascii user, such a mechanism is irrelevant,
because per definition you do not need it. Not only it useless,
it is penalizing, just by the fact of its existence. [*]

Conversely (or identically), if you are an ascii user, same situation,
it is irrelevant, useless and penalizing.

Practically, and today, all coding schemes we have
(including the endorsed Unicode utf transformers) work
with a unique set of of encoded code points. If you
wish to take the problem from the other side, it is
because one can only work properly with a unique set
of code points that so many coding schemes exist!

Question: does this FSR use internally three coding
schemes because it splits Unicode in three groups or
does it split Unicode in three subsets to have the joyce
to use three coding schemes?

About "micro benchmarks". What to say, they appear
practivally every time you use non ascii.

And do not forget memory. The €uro just become expensive.

>>> sys.getsizeof('$')
26
>>> sys.getsizeof('€')
40

I do not know. When an €uro char need 14 bytes more that
a dollar, I belong to those who thing there is a problem
somewhere.

This FSR is a royal gift for those who wish to teach Unicode
and the coding of characters.

jmf