[Python-Dev] Ill-defined encoding for CP875?

Tim Peters tim_one@email.msn.com
Sat, 12 May 2001 07:28:27 -0400


I have a way to make dict lookup a teensy bit cheaper(*) that significantly
reduces the number of collisions (which is much more valuable).

This caused a number of std tests to fail, because they were implicitly
relying on the order in which a dict's entries are materialized via .keys()
or .items().

Most of these were easy enough to fix.  The last failure remaining is
test_unicode, and I don't know how to fix it.  It's dying here:

    try:
        verify(unicode(s,encoding).encode(encoding) == s)
    except TestFailed:
        print '*** codec "%s" failed round-trip' % encoding
    except ValueError,why:
        print '*** codec for "%s" failed: %s' % (encoding, why)

when encoding == "cp875".  There's a bogus problem you have to worm around
first:  test_unicode neglected to import TestFailed, so it actually dies
with NameError while trying the "except TestFailed" clause after verify()
raises TestFailed.  Once that's repaired, it's complaining about failing the
round-trip encoding.

The original character in s it's griping about is "?" (0x3f).  cp875.py has
this entry in its decoding_map dict:

	0x003f: 0x001a,	# SUBSTITUTE

But 0x1a is not a *unique* value in this dict.  There's also

	0x00dc: 0x001a,	# SUBSTITUTE
	0x00e1: 0x001a,	# SUBSTITUTE
	0x00ec: 0x001a,	# SUBSTITUTE
	0x00ed: 0x001a,	# SUBSTITUTE
	0x00fc: 0x001a,	# SUBSTITUTE
	0x00fd: 0x001a,	# SUBSTITUTE

Therefore what appears associated with 0x1a in the derived encoding_map
dict:

encoding_map = {}
for k,v in decoding_map.items():
    encoding_map[v] = k

may end up being any of the 7 decoding_map keys that map to 0x1a.  It just
so happened to map back to 0x3f before, but to 0xfd after the dict change,
so "?" doesn't survive the round trip anymore.

My knowledge of encoding internals is exceeded only by my mastery of file
URLs under Windows <wink>, so I could sure use some help getting this
repaired.  I'd really like to check in the dict improvement (+ test
repairs), but won't do it so long as it makes a std test fail.  If, e.g.,
you're *relying* on "the first" of a set of ambiguous reverse mappings
winning the game, then iterating over decoding_map.items() in reverse sorted
order would do the trick reliablly.  But I don't know whether the ambiguity
in cp875 is a bug or an undocumented feature ...

7-bit-ascii-looks-better-every-day<wink>-ly y'rs  - tim


(*) Simply by taking the damn "~" off "~hash" -- I explained quite a while
ago why that can lead to a weak form of clustering "in theory", and
instrumenting the dict lookup code confirmed that it does hurt in real life.