[Tutor] three numbers for one

Mon Jun 10 14:55:18 CEST 2013

On 9 June 2013 16:22, Steven D'Aprano <steve at pearwood.info> wrote:
> On 10/06/13 00:26, Oscar Benjamin wrote:
>>
>> I don't think I like this behaviour: I don't mind the isdigit,
>> isdecimal and isnumeric methods but I don't want int() to accept
>> non-ascii characters. This is a reasonable addition to the unicodedata
>> module but should not happen when simply calling int().
>
> Why not? What problem do you see?

I don't know. I guess I just thought I understood what it was doing
but now realise that I didn't.

> Decimal digits are perfectly well defined. There is no ambiguity in what
> counts as a decimal digit and what doesn't.

Yes, but I thought it was using a different unambiguous and easier
(for me) to understand definition of decimal digits. I guess that I'm
just coming to realise exactly what Python 3's unicode support really
means and in many cases it means that the interpreter is doing things
that I don't want or need.

For example I very often pipe streams of ascii numeric text from one
program to another. In some cases the cost of converting to/from
decimal is actually significant and Python 3 will add to this both
with a more complex conversion and with its encoding/decoding part of
the io stack. I'm wondering whether I should really just be using
binary mode for this kind of thing in Python 3 since this at least
removes an unnecessary part of the stack.

In a previous thread where I moaned about the behaviour of the int()
function Eryksun suggested that it would be better if int() wan't used
for parsing strings at all. Since then I've thought about that and I
agree. There should be separate functions for each kind of string to
number conversion with one just for ascii decimal only.

>> To answer Jim's original question, there doesn't seem to be a function
>> to check for only plain old 0-9
>
> Why would you want to? For no extra effort, you can handle numbers written
> in just about any language.

I can see how this would be useful if numbers are typed in
interactively or perhaps given at the command line. There are many
other cases where this is not needed or desired though.

> You get this for free. Why would you want to
> work *harder* in order to be *less  useful*?
>
>> but you can make your own easily enough:
>>
>>>>> def is_ascii_digit(string):
>>
>> ...     return not (set(string) - set('0123456789'))
>
> That's buggy, because it claims that '' is an ascii digit.
>
> This is likely to be quicker, if not for small strings at least for large
> strings:
>
> def is_ascii_digit(string):
>     return string and all(c in set('0123456789') for c in string)

I wasn't really worried about speed but in that case I might try:

is_ascii_digit = frozenset('0123456789').__contains__

def is_ascii_digits(string):
    return string and all(map(is_ascii_digit, string))

Although Eryksun's regex is probably faster.

>> An alternative method depending on where your strings are actually
>> coming from would be to use byte-strings or the ascii codec. I may
>> consider doing this in future; in my own applications if I pass a
>> non-ascii digit to int() then I definitely have data corruption.
>
> It's not up to built-ins like int() to protect you from data corruption.
> Would you consider it reasonable for me to say "in my own applications, if I
> pass a number bigger than 100, I definitely have data corruption, therefore
> int() should not support numbers bigger than 100"?

I expect the int() function to reject invalid input. I thought that
its definition of invalid matched up with my own.

Oscar