[IronPython] IronPython codec names not compatible with CPython

Mon Oct 9 04:23:30 CEST 2006

On 8/10/2006 12:54 PM, John Machin wrote:
> CPython recognises both 'gbk' and 'cp936' i.e. unicode('some string', 
> 'gbk') does what you'd expect.
> IronPython 1.0.1 recognises only 'cp936'.
> 
> CPython recognises 'mac_roman', 'mac_greek', etc.
> IronPython doesn't.
> 
> After a [rare] flash of inspiration, I tried 'cp10000', 'cp10006', etc 
> and IronPython recognises these, which CPython doesn't.
> 
> The "differences" document says: """
> IronPython's _codecs module implementation is incomplete.  There are 
> several replace_error/lookup_error handlers that IronPython does not 
> implement.
> """
> It is not apparent whether this is intended to mean that missing error 
> handlers is the *only* known deficiency.
> 
> IronPython Bug #3214 mentions "import encodings" as fixing a 
> LookupError. Well, you learn something new every day:
> 1. CPython permits one to import encodings, but it's not documented 
> AFAICT, and it's *not* necessary in order to use 'gbk', 'mac_roman', etc.
> 2. After import encodings, IronPython recognises 'mac_roman' and 
> 'mac_greek', but still not 'gbk'.
> 
> How much of the above is bug and how much is feature? What is this 
> mysterious encodings module anyway? Does this mean the CPython test 
> suite doesn't cover the above cases? Are the equivalences (mac_roman, 
> cp10000) etc correct and official? Should I just dump all of the above 
> into the IronPython Issue Tracker?
> 

An update: I had appended
     sys.path.append(r"C:\python24\Lib")
to my IronPython site.py.

Removing that: IronPython doesn't have an encodings module ... so why 
does Bug #3214 say to import it?

Leaving it in:
unicode('\xf0', 'mac_roman') produces the wrong exception:

     exceptions.SystemError: Object reference not set to an instance of 
an object.

unicode('\xf0', 'mac_roman', 'replace') produces the same exception.

And for the curious, the two encodings are not exactly identical:

0xdb: mac_roman u'\xa4', cp_10000 u'\u20ac'
0xf0: mac_roman u'\ufffd', cp_10000 u'\uf8ff'
(the U+FFFD (REPLACEMENT CHARACTER) is what I stuffed into a DIY kludgy 
workaround; U+F8FF is not defined)

I was going to show the names of the characters, using 
unicodedata.name(), but there's no unicodedata module in IronPython (and 
that's not mentioned in the differences file).

Cheers,
John