Python & Unicode decimal interpretation

Sat Dec 3 05:19:13 EST 2005

Scott David Daniels wrote:
> In reading over the source for CPython's PyUnicode_EncodeDecimal,
> I see a dance to handle characters which are neither dec-equiv nor
> in Latin-1.  Does anyone know about the intent of such a conversion?

To support this:

 >>> int(u"\N{DEVANAGARI DIGIT SEVEN}")
7

> As far as I can tell, error handling is one of:
>     strict, replace, ignore, xmlcharrefreplace, or something_else
> What I don't understand is whether, in the ignore or something_else
> cases, there is any chance that digits will show up anywhere that
> they would not if these characters were treated as a character like '?'?
> 
> Can someone either give me definitive "why not" or (preferably) give
> me a test case that shows where that interpretation does not hold.

In the "ignore" case, no output is produced at all, for the unencodable
character; this is the same way that '?' would be treated (it is
also unencodable).

In the something_else case, a user-defined exception handler could
treat the error in any way it liked, e.g. encoding all letters
u'A' to digit '0'. This might be different from the way this error
handler would treat '?'.

Regards,
Martin