[issue11303] b'x'.decode('latin1') is much slower than b'x'.decode('latin-1')

Thu Feb 24 17:31:35 CET 2011

Marc-Andre Lemburg <mal at egenix.com> added the comment:

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner at haypocalc.com> added the comment:
> 
> I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like:
> 
>>>> import re
>>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
> ... 
>>>> normalize("UTF-8")
> 'utf8'
>>>> normalize("ISO-8859-1")
> 'iso88591'
>>>> normalize("latin1")
> 'latin1'
> 
> So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8.
> 
> I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-)

I think rather than removing any hyphens, spaces, etc. the
function should additionally:

 * add hyphens whenever (they are missing and) there's switch
   from [a-z] to [0-9]

That way you end up with the correct names for the given set of
optimized encoding names.

----------
title: b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') -> b'x'.decode('latin1') is much slower	than	b'x'.decode('latin-1')

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue11303>
_______________________________________