[Python-Dev] RE: Ill-defined encoding for CP875?

M.-A. Lemburg mal@lemburg.com
Wed, 16 May 2001 11:29:49 +0200


Tim Peters wrote:
> 
> [MAL]
> > Round-tripping is obviously very important if you use Unicode
> > as basis for working on text.
> 
> Since I use 7-bit ASCII exclusively, I've been using
> 
>     encode = decode = lambda x: x
> 
> I haven't proved that's round-trippable, but haven't bumped into an exception
> yet.

For character map codecs the complete range(256) of possible
input characters should pass the round-trip test, that is

	encoded text -> Unicode -> encoded text

should result in the identiy mapping for all c in map(chr,range(256)).
 
> > I don't know about the reasoning behind making cp875 fail the
> > round-trip -- Unicode certainly provides means to make mappings
> > round-trip safe (e.g. by reverting to the private Unicode
> > char. point areas).
> 
> Then I ignorantly but confidently (indeed, with the cheery confidence only
> the truly ignorant can truly enjoy!) vote for your approach that maps the
> non-round-trippable cp875 code points to None.  Better safe than sorry, by
> default.  Else 6 of the 7 ambiguous chars will be silent surprises by
> default.

I will check in a patch which moves the building logic for encoding
maps to codecs.py. This will simplify the task of choosing the
"right" solution. Currently I'm in favour of:

def make_encoding_map(decoding_map):

    """ Creates an encoding map from a decoding map.

        If a target mapping in the decoding map occurrs multiple
        times, then that target is mapped to None (undefined mapping),
        causing an exception when encountered by the charmap codec
        during translation.

        One example where this happens is cp875.py which decodes
        multiple character to \u001a.

    """
    m = {}
    for k,v in decoding_map.items():
        if not m.has_key(v):
            m[v] = k
        else:
            m[v] = None
    return m

Perhaps we should also have a codecs.finalize_decoding_map() API
in codecs.py which checks the decoding map and postprocesses
it in case it finds a problem ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/