[Python-Dev] Re: Be Honest about LC_NUMERIC [REPOST]

Mon Sep 1 22:20:55 EDT 2003

[Tim]
>> In short, I can't be enthusiastic about the patch because it doesn't
>> solve the only relevant locale problem I've actually run into.  I
>> understand that it may well solve many I haven't run into.

[Guido]
> At this point in your life, Tim, is there any patch you could be truly
> enthusiastic about? :-)

Yes, but I can't be enthusiastic about a hack, and especially not about a
hack that (as I said) doesn't solve the real-life problem spambayes has.

> I'm asking because I'd like to see the specific problem that started
> this thread solved,

At this point, can you state what that specific problem was <wink>?

> if necessary using a compromise that means the solution isn't perfect.
>  I'm even willing to take a step back in the status quo, given that the
> status quo isn't perfect anyway, and that compromises mean something has
> to give.
>
> *Maybe* the right solution is that we have to accept a
> hard-to-understand overcomplicated piece of code that we don't know
> how to maintain (but for which the author asserts that we won't have
> to do much maintenance in the foreseeable future).

I'm finding it hard to believe that anyone other than me and the author has
actually read the patch!  It's easy to understand.  It's over-complicated
for what Python needs, and would be dead easy to understand if the fluff got
chopped.  The *fear* of this code expressed in this thread is baffling to
me, but I suspect it's due to initial shell-shock from the sheer bulk of the
unnecessary code in the patch.

> But *maybe* there's a simpler solution.

>> OTOH, the specific problem I'm acutely worried about would be better
>> addressed by changing the way Python marhals float values.

> So solve it.

Sorry, I don't foresee making time to do that.

> The approach used by binary pickles seems entirely reasonable.

It's the best binary format we've got.  It has problems with 754's special
values (as recorded in PEP 42), and loses precision for VAX D format doubles
(any double format with greater dynamic range or precision than IEEE-754
double).  A decimal string is actually better on all those counts (dynamic
range is no problem then; and *some* platforms can preserve IEEE special
values via to-string-and-back conversion (Windows cannot)).  Decimal strings
lose on correctness only because of locale variations; depending on
platform, they may also lose on speed, but I don't give much weight to speed
here.

> All we need to do is change the .pyc magic number.  (There's undoubtedly
> user code in the world that would break because it requires
> interoperability between Python versions.  So let the marshal module grow
> a way to specify the format.)

> ...
> Fair enough.  So *if* we decide to use the donated conversion code, we
> should start by using it unconditionally.  I predict that at some
> point in the future we'll find a platform whose quirks are not handled
> by the donated code, and where it's simpler to use a correct native
> equivalent than to try to fix the donated code; but I expect that
> point to be pretty far in the future, *or* the platform to be pretty
> far from the main stream.

Do read the patch.  It amounts to

    if decimal_point != '.':
        s/./decimal_point/

in one direction and

    if decimal_point != '.':
        s/decimal_point/./

in the other.  It gets its idea of decimal_point from the platform
localeconv(), so if that doesn't lie it's hard to get wrong.  In the
double->string direction, though, the substitution code appears inadequate
to me, since it doesn't try to strip out thousand-separation characters,
which some locales produce.  For example, on Windows,

>>> locale.setlocale(locale.LC_ALL, "german")
'German_Germany.1252'
>>> locale.format("%g", 123456.0, 1)
'123.456'
>>>

AFAICT, the patch will leave that output as "123.456".  The string->double
direction is much easier to be confident about for this reason.

>>> ...
>>> Here's yet another idea (which probably has flaws as well): instead
>>> of substituting the locale's decimal separator, rewrite strings like
>>> "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3",
>>> then pass to strtod(), which assigns the same meaning to such
>>> strings in all locales.

[long example]

> I fail to see the relevance of the example to my proposed hack, except
> as a proof that the world isn't perfect -- but we already know that.

The point is that only perfect-rounding string->float routines can guarantee
to produce identical doubles from mathematically equivalent decimal string
representations.   Finding counterexamples for non-perfect-rounding
libraries is extremely difficult, and/or time-consuming, without studying
the source code of a specific library intensely (almost certainly with more
intensity than its author gave to writing it!), and I don't have time for
that.  It's a potential vulnerability.  Answering whether it's an actual
vulnerability in practice is much more work than I can give to it now.

> Under my proposal, the number of digits converted would never change,
> so any sensitivity of the algorithm used to the number of digits
> converted would be irrelevant.  I note that the strtod.c code that's
> currently in the Python source tree uses a similar (though opposite)
> trick: it converts the number to the form 0.<fraction>E<expt> before
> handing it off to atof().  So my proposal still stands.  I'm happy to
> entertain a proof that it's flawed but not one where the flawed input
> has over 5000 digits *and* depends on a flaw in the platform routines.

As hacks go, it's probably OK.  I don't think it can fail on glibc-based
platforms because I think they do perfect-rounding conversions; the Windows
conversion routines aren't perfect-rounding, but we don't have their source
code so it's impossible for me to give examples offhand where different
results could be delivered, or even to swear that there are (or aren't) such
cases.  I give it a lot of credit for being truly threadsafe.

Note that it doesn't address the other half of the locale conversion problem
(double->string), which, as I noted above, is the harder half (due to
thousands_sep becoming an additional issue).