[ python-Bugs-850997 ] mbcs encoding ignores errors

Mon Dec 1 16:25:17 EST 2003

Bugs item #850997, was opened at 2003-11-29 02:24
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=850997&group_id=5470

Category: Windows
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Nobody/Anonymous (nobody)
Summary: mbcs encoding ignores errors

Initial Comment:
The following snippet:

>>> u'@test-\u5171'.encode("mbcs", "strict")
'@test-?'

Should raise a UnicodeError.  The errors param is
completely ignored, and the function always works as
though errors='replace'.

Attaching a test case, and the start of a patch.  The
patch has a number of issues:
* I'm not sure what errors are considered 'mandatory'.
 I have handled 'strict', 'ignore' and 'replace' -
however, 'ignore' and 'replace' currently are exactly
the same (ie, replace)
* The Windows functions don't tell us exactly what
character failed in the conversion.  Thus, the
exception I raise implies the first character is the
one that failed.  For the same reason, I have made no
attempt to support error callbacks.

Comments/guidance appreciated.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2003-12-01 22:25

Message:
Logged In: YES 
user_id=21627

The conventional semantics of "ignore" would be "remove the
failing characters from the output". This would be difficult
to implement if the Microsoft API provides no detailed error
indication.

You could try to get more detailed error indication by
re-encoding the resulting string with a NULL buffer,
counting the number of characters that have successfully
been encoded, atleast in the .decode case. 

In the .encode case, you could try using \0 as the default
char. To my knowledge, no ACP ever uses \0 in a multi-byte
string.

What is the meaning of the WC_DEFAULTCHAR flag, in
WideCharToMultiByte, and why are you not using it?

I'm somewhat concerned with backwards compatibility, since
the mbcs codec has never returned errors. So this should be
applied to 2.4 only, and listed in whatsnew.tex.

----------------------------------------------------------------------

Comment By: Thomas Heller (theller)
Date: 2003-11-29 16:18

Message:
Logged In: YES 
user_id=11105

No idea why this was assigned to me - unicode is certainly
not one of my strengths.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-11-29 02:31

Message:
Logged In: YES 
user_id=14198

Attaching a patch.  This patch also attempts to handle
Encode, but I haven't worked out how to exercise this
code-path - ie, what mbcs encoded string can I pass that can
not be converted to unicode?

As I mentioned, patch has a few issues

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=850997&group_id=5470