Flexible string representation, unicode, typography, ...

Sun Sep 2 14:58:08 EDT 2012

Le dimanche 2 septembre 2012 11:07:35 UTC+2, Ian a écrit :
> On Sun, Sep 2, 2012 at 1:36 AM,  <wxjmfauth at gmail.com> wrote:
> 
> > I still remember my thoughts when I read the PEP 393
> 
> > discussion: "this is not logical", "they do no understand
> 
> > typography", "atomic character ???", ...
> 
> 
> 
> That would indicate one of two possibilities.  Either:
> 
> 
> 
> 1) Everybody in the PEP 393 discussion except for you is clueless
> 
> about how to implement a Unicode type; or
> 
> 
> 
> 2) You are clueless about how to implement a Unicode type.
> 
> 
> 
> Taking into account Occam's razor, and also that you seem to be unable
> 
> or unwilling to offer a solid rationale for those thoughts, I have to
> 
> say that I'm currently leaning toward the second possibility.
> 
> 
> 
> 
> 
> > Real world exemples.
> 
> >
> 
> >>>> import libfrancais
> 
> >>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
> 
> > ...     'noétique', 'noèse', 'noirâtre']
> 
> >>>> r = libfrancais.sortfr(li)
> 
> >>>> r
> 
> > ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
> 
> > 'noirâtre']
> 
> 
> 
> libfrancais does not appear to be publicly available.  It's not listed
> 
> in PyPI, and googling for "python libfrancais" turns up nothing
> 
> relevant.
> 
> 
> 
> Rewriting the example to use locale.strcoll instead:
> 
> 
> 
> >>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
> 
> >>> import locale
> 
> >>> locale.setlocale(locale.LC_ALL, 'French_France')
> 
> 'French_France.1252'
> 
> >>> import functools
> 
> >>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
> 
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']
> 
> 
> 
> # Python 3.2
> 
> >>> import timeit
> 
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
> 
> [0.5544277025009592, 0.5370117249557325, 0.5551836677925053]
> 
> 
> 
> # Python 3.3
> 
> >>> import timeit
> 
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
> 
> [0.1421166788364303, 0.12389078130001963, 0.13184190553613462]
> 

> 
> As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
> 
> example.  If this was intended to show that the Python 3.3 Unicode
> 
> representation is a regression over the Python 3.2 implementation,
> 
> then it's a complete failure as an example.

- Unfortunately, I got opposite and even much worst results on my win box,
considering
- libfrancais is one of my module and it does a little bit more than
the std sorting tools. 

My rationale: very simple.

1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.

jmf