Undeterministic strxfrm?

Tuomas tuomas.vesterinen at pp.inet.fi
Tue Sep 4 15:54:57 EDT 2007


Gabriel Genellina wrote:
> En Tue, 04 Sep 2007 07:34:54 -0300, Tuomas 
> <tuomas.vesterinen at pp.inet.fi>  escribi�:
> 
>> Python 2.4.3 (#3, Jun  4 2006, 09:19:30)
>> [GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>  >>> import locale
>>  >>> def key(s):
>> ...     locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
>> ...     return locale.strxfrm(s.encode('utf8'))
>> ...
>>  >>> first=key(u'maupassant guy')
>>  >>> first==key(u'maupassant guy')
>> False
>>  >>> first
>> '\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
>> $\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79' 
>>
>>  >>> key(u'maupassant guy')
>> '\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
>> $\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5' 
>>
>>  >>>
>>
>> May be this is enough for a sort order but I need to be able to catch
>> equals too. Any hints/explanations?
> 
> 
> I can't use your same locale, but with my own locale settings, I get  
> consistent results:
> 
> Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit  
> (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> py> import locale
> py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
> 'Spanish_Argentina.1252'
> py> def key(s):
> ...   return locale.strxfrm(s.encode('utf8'))
> ...

Because I am writing a multi language application I need to plase the 
locale setting inside the key function. Actually I am implementing 
binary search in a locally sorted list of strings and should be able to 
count on stable results of strxfrm despite possibly visiting another 
locale at meantime. Could repeated calls to setlocale cause some problems?

> py> first=key(u'maupassant guy')
> py> print repr(first)
> '\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02 
> 
> \x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
> py> print repr(key(u'maupassant guy'))
> '\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02 
> 
> \x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
> py> print first==key(u'maupassant guy')
> True
> 
> Same thing with Python 2.4.4
> 

I get the same unstability with my locale 'fi_FI.utf8' too, so I am 
wondering if the source of the problem is the clib or the Python wrapper 
around it. Differences in strxfrm results for identical source are 
allways in the few latest bytes of the results.






More information about the Python-list mailing list