[Python-Dev] Python and the Unicode Character Database

Thu Dec 2 22:57:45 CET 2010

On 12/2/2010 4:48 PM, "Martin v. Löwis" wrote:
> Am 02.12.2010 22:30, schrieb Steven D'Aprano:
>> Martin v. Löwis wrote:
>>>>> Then these users should speak up and indicate their need, or somebody
>>>>> should speak up and confirm that there are users who actually want
>>>>> '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing
>>>>> system in which '١٢٣٤.٥٦e4' means 12345600.0.
>>>> I'm not sure what you're after here.
>>>
>>> That the current float() constructor accepts tons of bogus character
>>> strings and accepts them as numbers, and that it should stop doing so.
>>
>> What bogus characters do the float() and int() constructors accept? As
>> far as I can see, they only accepts numerals.
>
> Not bogus characters, but bogus character strings. E.g. strings that mix
> digits from different scripts, and mix them with the Python decimal
> separator.
>
>>> Notice that Python does *not* currently support printing numbers in
>>> other scripts - even though this may actually be more useful than
>>> parsing.
>>
>> Lack of one function, even if more useful, does not imply that an
>> existing function should be removed.
>
> No. But if the specific function(ality) is not useful and
> underspecified, it should be removed.
>
>> So your problems with the current behaviour are:
>>
>> (1) in some unspecified way, it's not done correctly;
>
> No. My main concern is that it is not properly specified. If it was
> specified, I could then tell you what precisely is wrong about it.
> Right now, I can only give examples for input that it should not accept,
> and examples of input that it should, but does not accept.
>
>> (2) it belongs somewhere other than float() and int().
>
> That's only because it also needs a parameter to specify what syntax to
> follow, somehow. That parameter could be explicit or implicit, and it
> could be to float or to some other function. But it must be available,
> and is not.
>
>> That second is awfully close to bike-shedding. Since you accept that
>> Python *should* have the current behaviour
>
> No, I don't. I think it behaves incorrectly, accepting garbage input and
> guessing some meaning out of it.
>
>> - how the current behaviour is incorrect;
>
> See above: it accepts strings that do not denote real numbers in any
> writing system, and, despite the claim that the feature is there to
> support other writing systems, actually does not truly support other
> writing systems.
>
>> - your suggestions for correcting it; and
>
> Make the current implementation exactly match the current documentation.
> I think the documentation is correct; the implementation is wrong.
>
>> - a concrete suggestion for where you would like to see the behaviour
>> moved to, and why that would be better than where it currently is.
>
> The current behavior should go nowhere; it is not useful. Something very
> similar to the current behavior (but done correctly) should go into the
> locale module.

I agree with everything Martin says here. I think the basic premise is: 
you won't find strings "in the wild" that use non-ASCII digits but do 
use the ASCII dot as a decimal point. And that's what float() is looking 
for. (And that doesn't even begin to address what it expects for an 
exponent 'e'.)

Eric.