[Tutor] three numbers for one

Steven D'Aprano steve at pearwood.info
Sun Jun 9 17:22:43 CEST 2013


On 10/06/13 00:26, Oscar Benjamin wrote:
> On 8 June 2013 06:49, eryksun <eryksun at gmail.com> wrote:
>> On Fri, Jun 7, 2013 at 11:11 PM, Jim Mooney <cybervigilante at gmail.com> wrote:
>>> I'm puzzling out the difference between isdigit, isdecimal, and
>>> isnumeric. But at this point, for simple  practice programs, which is
>>> the best to use for plain old 0123456589 , without special characters?
>>
>> The isnumeric, isdigit, and isdecimal predicates use Unicode character
>> properties that are defined in UnicodeData.txt:
>>
>> http://www.unicode.org/Public/6.1.0/ucd
>>
>> The most restrictive of the 3 is isdecimal. If a string isdecimal(),
>> you can convert it with int() -- even if you're mixing scripts:
>>
>>      >>> unicodedata.name('\u06f0')
>>      'EXTENDED ARABIC-INDIC DIGIT ZERO'
>>      >>> unicodedata.decimal('\u06f0')
>>      0
>>      >>> '1234\u06f0'.isdecimal()
>>      True
>>      >>> int('1234\u06f0')
>>      12340
>
> I didn't know about this. In the time since this thread started a
> parallel thread has emerged on python-ideas and it seems that Guido
> was unaware of these changes in Python 3:

Python is a pretty big language (although not as big as, say, Java). Nobody knows every last feature in the language and standard library, and by Guido's own admission, he's pretty much stuck in the ASCII-only world.


> http://mail.python.org/pipermail/python-ideas/2013-June/021216.html
>
> I don't think I like this behaviour: I don't mind the isdigit,
> isdecimal and isnumeric methods but I don't want int() to accept
> non-ascii characters. This is a reasonable addition to the unicodedata
> module but should not happen when simply calling int().

Why not? What problem do you see?

Decimal digits are perfectly well defined. There is no ambiguity in what counts as a decimal digit and what doesn't.


> To answer Jim's original question, there doesn't seem to be a function
> to check for only plain old 0-9

Why would you want to? For no extra effort, you can handle numbers written in just about any language. You get this for free. Why would you want to work *harder* in order to be *less  useful*?



> but you can make your own easily enough:
>
>>>> def is_ascii_digit(string):
> ...     return not (set(string) - set('0123456789'))

That's buggy, because it claims that '' is an ascii digit.


This is likely to be quicker, if not for small strings at least for large strings:

def is_ascii_digit(string):
     return string and all(c in set('0123456789') for c in string)



> An alternative method depending on where your strings are actually
> coming from would be to use byte-strings or the ascii codec. I may
> consider doing this in future; in my own applications if I pass a
> non-ascii digit to int() then I definitely have data corruption.

It's not up to built-ins like int() to protect you from data corruption. Would you consider it reasonable for me to say "in my own applications, if I pass a number bigger than 100, I definitely have data corruption, therefore int() should not support numbers bigger than 100"?


> Then
> again it's unlikely that the corruption would manifest itself in
> precisely this way since only a small proportion of non-ascii unicode
> characters would be accepted by int().

Indeed :-)



-- 
Steven


More information about the Tutor mailing list