Python & Unicode decimal interpretation

"Martin v. Löwis" martin at v.loewis.de
Sat Dec 3 13:31:53 EST 2005


Scott David Daniels wrote:
>>  >>> int(u"\N{DEVANAGARI DIGIT SEVEN}")
>> 7
> 
> OK, That much I have handled.  I am fiddling with direct-to-number
> conversions and wondering about cases like
>    >>> int(u"\N{DEVANAGARI DIGIT SEVEN}" + XXX
>            + u"\N{DEVANAGARI DIGIT SEVEN}")

int() passes NULL as error mode, equalling strict. So if you get an
unencodable character, you get the UnicodeError.

> I don't really understand how the "ignore" or "something_else"
> cases get caused by python source [where they come from].  Are they
> only there for C-program access?

Neither, nor. This code is dead.

>> In the "ignore" case, no output is produced at all, for the unencodable
>> character; this is the same way that '?' would be treated (it is
>> also unencodable).
> 
> If I understand you correctly -- I can consider the digit stream to stop
> as soon as I hit a non-digit (except for handling bases 11-36).

No. In "ignore" mode, a codec doesn't stop at the unencodable character.
Instead, it skips it, continuing with the next character.

I mistakenly said that this would happen to '?' (question mark) also;
this is incorrect: PyUnicode_EncodeDecimal copies all Latin-1 characters
to the output, latin-1-encoded. So '?' would appear in the output,
even in "ignore" mode.

Handling of bases is not done in the function at all. Instead, the
callers of PyUnicode_EncodeDecimal will deal with number formats
(base, prefix, exponent syntax, etc.) They will assume ASCII
bytes.

Regards,
Martin



More information about the Python-list mailing list