[Python-Dev] Re: Be Honest about LC_NUMERIC [REPOST]

Mon Sep 1 15:30:23 EDT 2003

[Martin]
> Ok. Are you then, overall, in favour of taking the proposed approach?

It solves part of one problem; I'd rather solve all of it, but can't
volunteer time to do that.

> It is not thread-safe, but only so if somebody calls setlocale in a
> different thread, and that is known not to be thread-safe - so I could
> live with that limitation.

There's no way of using C's locale gimmicks that's threadsafe, short of all
callers agreeing to follow a beyond-standard-C exclusion protocol -- which
is the same as saying "no way" in reality.  So that's part of one problem no
patch of this ilk *can* solve.  It's not that the patch doesn't try hard
enough, it's that this approach is inherently inadequate to solve all of
this particular problem.

> It is just that the patch does not "feel" right, given that there must
> be "native" locale-inaware parsing of floating point constants
> somewhere on each platform (atleast on those that support C++98).

I haven't found one on Windows (doesn't mean it doesn't exist, does mean
it's apparently well hidden if it does exist).

> ...
> One of my early concerns (and I still have this concern) is that the
> contributors here appear to take the position "We have this fine code
> developed elsewhere, it seems to work, so we copy it. We don't
> actually have to understand this code". I would feel more comfortable
> if the code was written from scratch for usage in Python, with just
> the ideas borrowed from glib. Proper attribution of contributors and
> licensing are just one aspect, we really need the submitter of the
> code fully understand it, and be capable of reacting to problems
> quickly.

The patch is certainly more code than is needed to solve the part of the
problem it does solve.  For example, things like

typedef char		gchar;
typedef short		gshort;
typedef long   		glong;
typedef int    		gint;

introduce silly synonyms ("silly" == typing gshort instead of short does
nothing except introduce possibilities for confusion); there are many
definitions like

#define g_ascii_isupper(c) \
  ((g_ascii_table[(guchar) (c)] & G_ASCII_UPPER) != 0)

that are never referenced; the code caters to C99's hexadecimal float
literals but Python doesn't; and so on.  If someone who understood Python
internals read my earlier two-sentence description of how the patch works,
they could write something that works equally well for Python's purposes
with a fraction of the code introduced by the patch.

> ...
> The PEP should also point out deficiencies of the approach taken,
> e.g. the issue of spelling NaN, inf, etc. If it can be determined not
> to be an issue in real life (i.e. for all interesting platforms), this
> should be documented as well.

Well, the patch doesn't even pretend to address other issues with
portability of float literals.  They routinely come up on c.l.py, so of
course users bump into them; when someone is motivated enough to file a bug
report, I shuffle it off to PEP 42, under the "non-accidental 754 support"
heading (which covers many fp issues beyond just literals, of course).

[James Henstridge]
> ...
> Your average localised package usually switches to the user's
> preferred locale on startup, so that it can display strings and
> messages, and occasionally wants to read/write numbers in a locale
> independent format (usually when saving/loading files).  The most
> common way of doing this is the setlocale/strtod/setlocale combo,
> which has thread safety problems and possible reentrancy problems if
> done wrong.

I became acutely aware of the problems here due to the spambayes project,
part of which embeds Python in Outlook 2000/2002.  Outlook routinely runs
more than a dozen threads, and by observation changes locale "frequently".
None of that is documented, Python has no influence over when or why Outlook
decides to switch locale, and neither can Python exclude Outlook's other
threads when the Outlook thread Python is running in becomes active.

Mark Hammond solved our problems there by forcing locale back to "C" every
chance he gets; that's an anti-social and probabilistic approach, but
appears to be the best spambayes can do today.  Having spambayes grow its
own float<->string code doesn't help, because the worst problem spambayes
had is that Python's marshal format uses ASCII strings to store float
literals in .pyc files, so that Python itself can (and does) load insane
float values out of .pyc files if LC_NUMERIC isn't "C" at the time a .pyc
file gets imported.

The only thing that could truly solve spambayes's problems here is for
Python to use a thoroughly thread-safe string->float routine, where
"thoroughly" includes not caring whether other threads switch locale in
mid-stream.

An irony is that Microsoft's *native* locale gimmicks are thread-safe (each
Win32 thread has its own idea of Win32 locale); why Outlook is even mucking
with C's thread-braindead notion of locale is a mystery.

In short, I can't be enthusiastic about the patch because it doesn't solve
the only relevant locale problem I've actually run into.  I understand that
it may well solve many I haven't run into.

OTOH, the specific problem I'm acutely worried about would be better
addressed by changing the way Python marhals float values.

[Guido]
> Maybe at least we can detect platforms for which we know there is a
> native conversion in the library, and not use the hack on those?

I rarely find that piles of conditionalized code are more comprehensible or
reliable; they usually result in mysterious x-platform differences, and
become messier over time as we stumble into more platform library bugs,
quirks, and limitations.

> ...
> Here's yet another idea (which probably has flaws as well): instead of
> substituting the locale's decimal separator, rewrite strings like
> "3.14" as "314e-2" and rewrite strings like "3.14e5" as "314e3", then
> pass to strtod(), which assigns the same meaning to such strings in
> all locales.

This is a harder transformation than s/./locale_decimal_point/.  It does
address the thread-safety issue.  Numerically it's flaky, as only a
perfectly-rounding string->float routine can guarantee to return bit-for-bit
identical results given equivalent (viewed as infinite precision) decimal
representations as inputs, and few platform string->float routines do
perfect rounding.

> This removes the question of what decimal separator is used by the
> locale completely, and thus removes the last bit of thread-unsafety
> from the code.  However, I don't know if underflow can cause the result
> to be different, e.g. perhaps 1.23eX could be computed but 123e(X-2)
> could not???  (Sounds pretty unlikely on the face of it since I'd expect
> any decent conversion algorithm to pretty much break its input down into
> a string of digits and an exponent, but I've never actually studied
> such algorithms in detail.)

Each library is likely fail in its own unique ways.  Here's a cute one:

"""
base = 1.2345678901234567

digits = "12345678901234567"

for exponent in range(-16, -15000, -1):
    string = digits + "0" * (-16 - exponent)
    string += "e%d" % exponent
    derived = float(string)
    assert base == derived, (string, derived)
"""

On Windows, this first fails at exponent -5202, where float(string) delivers
a result a factor of 10 too large.  I was surprised it did that well!  Under
Cygwin Python 2.2.3, it consumed > 14 minutes of CPU time, but never failed.
I believe they're using a derivative of David Gay's excruciatingly complex
IEEE-754 perfect-rounding string<->float routines (which would explain both
why it didn't fail and why it consumed enormous CPU time; the code is
excruciatingly complex because it does perfect rounding quickly for "normal"
inputs, via a large variety of delicate speed tricks; when those tricks
don't apply, it has to simulate unbounded-precision arithmetic to guarantee
perfect rounding).