Undeterministic strxfrm?
Tuomas
tuomas.vesterinen at pp.inet.fi
Tue Sep 4 15:54:57 EDT 2007
Gabriel Genellina wrote:
> En Tue, 04 Sep 2007 07:34:54 -0300, Tuomas
> <tuomas.vesterinen at pp.inet.fi> escribi�:
>
>> Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
>> [GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> import locale
>> >>> def key(s):
>> ... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
>> ... return locale.strxfrm(s.encode('utf8'))
>> ...
>> >>> first=key(u'maupassant guy')
>> >>> first==key(u'maupassant guy')
>> False
>> >>> first
>> '\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
>> $\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79'
>>
>> >>> key(u'maupassant guy')
>> '\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
>> $\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'
>>
>> >>>
>>
>> May be this is enough for a sort order but I need to be able to catch
>> equals too. Any hints/explanations?
>
>
> I can't use your same locale, but with my own locale settings, I get
> consistent results:
>
> Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
> (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> py> import locale
> py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
> 'Spanish_Argentina.1252'
> py> def key(s):
> ... return locale.strxfrm(s.encode('utf8'))
> ...
Because I am writing a multi language application I need to plase the
locale setting inside the key function. Actually I am implementing
binary search in a locally sorted list of strings and should be able to
count on stable results of strxfrm despite possibly visiting another
locale at meantime. Could repeated calls to setlocale cause some problems?
> py> first=key(u'maupassant guy')
> py> print repr(first)
> '\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
>
> \x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
> py> print repr(key(u'maupassant guy'))
> '\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
>
> \x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
> py> print first==key(u'maupassant guy')
> True
>
> Same thing with Python 2.4.4
>
I get the same unstability with my locale 'fi_FI.utf8' too, so I am
wondering if the source of the problem is the clib or the Python wrapper
around it. Differences in strxfrm results for identical source are
allways in the few latest bytes of the results.
More information about the Python-list
mailing list