Custom alphabetical sort

Fri Dec 28 05:27:32 EST 2012

Le vendredi 28 décembre 2012 00:17:53 UTC+1, Ian a écrit :
> On Thu, Dec 27, 2012 at 3:17 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> 
> >> PS Py 3.3 warranty: ~30% slower than Py 3.2
> 
> >
> 
> >
> 
> > Do you have any actual timing data to back up that claim?
> 
> > If so, please give specifics, including build, os, system, timing code, and
> 
> > result.
> 
> 
> 
> There was another thread about this one a while back.  Using IDLE on Windows XP:
> 
> 
> 
> >>> import timeit, locale
> 
> >>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
> 
> >>> locale.setlocale(locale.LC_ALL, 'French_France')
> 
> 'French_France.1252'
> 
> 
> 
> >>> # Python 3.2
> 
> >>> min(timeit.repeat("sorted(li, key=locale.strxfrm)", "import locale; from __main__ import li", number=100000))
> 
> 1.1581226105552531
> 
> 
> 
> >>> # Python 3.3.0
> 
> >>> min(timeit.repeat("sorted(li, key=locale.strxfrm)", "import locale; from __main__ import li", number=100000))
> 
> 1.4595282361305697
> 
> 
> 
> 1.460 / 1.158 = 1.261
> 
> 
> 
> >>> li = li * 100
> 
> >>> import random
> 
> >>> random.shuffle(li)
> 
> 
> 
> >>> # Python 3.2
> 
> >>> min(timeit.repeat("sorted(li, key=locale.strxfrm)", "import locale; from __main__ import li", number=1000))
> 
> 1.233450899485831
> 
> 
> 
> >>> # Python 3.3.0
> 
> >>> min(timeit.repeat("sorted(li, key=locale.strxfrm)", "import locale; from __main__ import li", number=1000))
> 
> 1.5793845307155152
> 
> 
> 
> 1.579 / 1.233 = 1.281
> 
> 
> 
> So about 26% slower for sorting a short list of French words and about
> 
> 28% slower for a longer list.  Replacing the strings with ASCII and
> 
> removing the 'key' argument gives a comparable result for the long
> 
> list but more like a 40% slowdown for the short list.

----

Not related to this thread, for information.

My sorting algorithm is doing a little bit more than a 
"locale.strxfrm". locale.strxfrm works precisely fine with
the list I gave as an exemple, it fails in many cases. One
of the bottlenecks is the "œ", which must be seen as "oe".
It is not the place to discuss this kind of linguistic aspects
here.

My algorithm does not use unicodedata or unicode normalization.
Mainly a lot of chars / substrings substitution for the
creation of the primary keys.

jmf