[Python-Dev] Re: Ill-defined encoding for CP875?

M.-A. Lemburg mal@lemburg.com
Sun, 13 May 2001 19:20:01 +0200


Tim Peters wrote:
> 
> I have a way to make dict lookup a teensy bit cheaper(*) that significantly
> reduces the number of collisions (which is much more valuable).
> 
> This caused a number of std tests to fail, because they were implicitly
> relying on the order in which a dict's entries are materialized via .keys()
> or .items().
> 
> Most of these were easy enough to fix.  The last failure remaining is
> test_unicode, and I don't know how to fix it.  It's dying here:
> 
>     try:
>         verify(unicode(s,encoding).encode(encoding) == s)
>     except TestFailed:
>         print '*** codec "%s" failed round-trip' % encoding
>     except ValueError,why:
>         print '*** codec for "%s" failed: %s' % (encoding, why)
> 
> when encoding == "cp875".  There's a bogus problem you have to worm around
> first:  test_unicode neglected to import TestFailed, so it actually dies
> with NameError while trying the "except TestFailed" clause after verify()
> raises TestFailed.  Once that's repaired, it's complaining about failing the
> round-trip encoding.

Ooops; this must have been caused by the assert statment
removal in the test suite I hacked up some months ago. Funny that
it never showed up... the code seems to be very robust ;-)
 
> The original character in s it's griping about is "?" (0x3f).  cp875.py has
> this entry in its decoding_map dict:
> 
>         0x003f: 0x001a, # SUBSTITUTE
> 
> But 0x1a is not a *unique* value in this dict.  There's also
> 
>         0x00dc: 0x001a, # SUBSTITUTE
>         0x00e1: 0x001a, # SUBSTITUTE
>         0x00ec: 0x001a, # SUBSTITUTE
>         0x00ed: 0x001a, # SUBSTITUTE
>         0x00fc: 0x001a, # SUBSTITUTE
>         0x00fd: 0x001a, # SUBSTITUTE
> 
> Therefore what appears associated with 0x1a in the derived encoding_map
> dict:
> 
> encoding_map = {}
> for k,v in decoding_map.items():
>     encoding_map[v] = k
> 
> may end up being any of the 7 decoding_map keys that map to 0x1a.  It just
> so happened to map back to 0x3f before, but to 0xfd after the dict change,
> so "?" doesn't survive the round trip anymore.

The "right" thing to do here, is to simply remove cp875
from the test for round-tripping. It is not the only encoding
which fails this test, but it's not our fault: the codecs were
all generated from the original codec maps at the Unicode.org site.

If their mappings are broken, we can't do much about it... other
than to ignore the error or remove the codec altogether.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/