[New-bugs-announce] [issue18625] ks_c-5601-1987 is used by microsoft when it really means cp949

Fri Aug 2 01:54:04 CEST 2013

New submission from R. David Murray:

When Microsoft handles Korean text, it uses its own code page, cp949, which is a superset of ks_c-5601-1987.  But when talking to the rest of the world, it claims that the character set name is ks_c-5601-1987.  This means that text claimed to be in ks_c-5601-1987 in email messages (and probably on web pages) can't always be decoded using the codec that ks_c-5601-1987 maps to (euc_kr). [*]

This problem shows up in the real world in email.  If non euc_kr characters are used, email will try blow up when trying to decode the ostensibly ks_c-5601-1987 text.  (I'm not sure if it will also blow up trying to encode it, I'm not sure what characters the two codecs cover.)

I'm not sure what the best solution is, but one possibility would be to add a "fixup" table to email that would cause it to decode ostensibly ks_c-5601-1987 text using the cp949 codec.  Since cp949 is a superset, this should at least solve the input side.

[*] Some relevant standards discussion:

    http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html

>From this, it isn't clear why we map ks_c-5601-1987 to euc_kr, since they at least appear to be different codecs.  I haven't looked at the relevant RFCs to see what the differences are, though.

----------
components: Unicode, email
messages: 194139
nosy: barry, ezio.melotti, r.david.murray
priority: normal
severity: normal
status: open
title: ks_c-5601-1987 is used by microsoft when it really means cp949
type: enhancement
versions: Python 3.4

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18625>
_______________________________________