Undeterministic strxfrm?

Chris Mellon arkanes at gmail.com
Tue Sep 4 16:49:20 EDT 2007


On 9/4/07, Tuomas <tuomas.vesterinen at pp.inet.fi> wrote:
> Gabriel Genellina wrote:
> > En Tue, 04 Sep 2007 07:34:54 -0300, Tuomas
> > <tuomas.vesterinen at pp.inet.fi>  escribi�:
> >
> >> Python 2.4.3 (#3, Jun  4 2006, 09:19:30)
> >> [GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
> >> Type "help", "copyright", "credits" or "license" for more information.
> >>  >>> import locale
> >>  >>> def key(s):
> >> ...     locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
> >> ...     return locale.strxfrm(s.encode('utf8'))
> >> ...
> >>  >>> first=key(u'maupassant guy')
> >>  >>> first==key(u'maupassant guy')
> >> False
> >>  >>> first
> >> '\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
> >> $\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79'
> >>
> >>  >>> key(u'maupassant guy')
> >> '\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
> >> $\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'
> >>
> >>  >>>
> >>
> >> May be this is enough for a sort order but I need to be able to catch
> >> equals too. Any hints/explanations?
> >
> >
> > I can't use your same locale, but with my own locale settings, I get
> > consistent results:
> >
> > Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
> > (Intel)] on
> > win32
> > Type "help", "copyright", "credits" or "license" for more information.
> > py> import locale
> > py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
> > 'Spanish_Argentina.1252'
> > py> def key(s):
> > ...   return locale.strxfrm(s.encode('utf8'))
> > ...
>
> Because I am writing a multi language application I need to plase the
> locale setting inside the key function. Actually I am implementing
> binary search in a locally sorted list of strings and should be able to
> count on stable results of strxfrm despite possibly visiting another
> locale at meantime. Could repeated calls to setlocale cause some problems?
>
> > py> first=key(u'maupassant guy')
> > py> print repr(first)
> > '\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
> >
> > \x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
> > py> print repr(key(u'maupassant guy'))
> > '\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
> >
> > \x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
> > py> print first==key(u'maupassant guy')
> > True
> >
> > Same thing with Python 2.4.4
> >
>
> I get the same unstability with my locale 'fi_FI.utf8' too, so I am
> wondering if the source of the problem is the clib or the Python wrapper
> around it.

Looking at the python source, the only possible error case I can see
is that the wrapper assumes the string returned by strxfrm will be
null terminated.

It's not 100% clear from the documentation I have that the string is
guaranteed to be null terminated, although it's implied, so this is a
remotely possible case. You might try calling the clib directly.


More information about the Python-list mailing list