[spambayes-dev] Config files and locale

Tim Peters tim.one at comcast.net
Thu Jul 24 00:15:52 EDT 2003


[Mark Hammond]
> For those not following the main list, Outlook is having a few
> "locale" issues.  You will all have the pleasure of reading more
> about it soon, as I intend moving the discussion here - but that
> issue spawned a side-issue which I believe will affect all SpamBayes
> apps.

Note that I also spread this to the Python-Dev list, right before your msg
here arrived.

> I note that OptionsClass.py uses functions from the "locale" module
> to parse numbers.  Apart from the fact that these functions convert
> apparently insane values when not using a period as a decimal point,
> there is a more fundamental issue.

Python was designed to assume "C" locale is in effect, as the C standard
mandates must be the case when a program starts up.  Among other things,
it's the only way to get portable numerical literals.  Trying to cater to
random locales will not work in the end.

> If an application ever wants to ship a configuration file that
> includes a floating point number, we *must* assume an 'en' locale,
> (as we will be shipping 'x.x' in the file).

Not quite:  we must assume the "C" locale.  This is what Americans think of
as being the right thing for numbers, but thinking of it as 'en' truly
confuses the requirement:  Python works very hard to make sure the
LC_NUMERIC category remains in the "C" locale, even if you use
locale.setlocale() to *try* to set LC_NUMERIC to "en" (or to anything else).
Python can get screwed, though, by C code outside of Python's calling C
setlocale() itself.

> This means we will force users to also use our conventions when
> editing the config file.

Well, the "C" locale conventions are mandated by the (international)
ANSI/ISO C standard.  It's a question of international sanity here, not of
demanding the whole world be American <wink>.  If the "C" locale defined
that comma was the radix point, then we'd have to follow that instead.

> This means that the way things stand, it is *not* possible for an app
> to ship with a new default of, say, 'spam_cutoff'.  It means that
> most examples of config files will be wrong for non 'us' locales, and
> that config files can't be shared amongst different users who may
> have different locales.
>
> Am I correct in seeing this as a problem?

Absolutely.

> One error prone, slightly insane (but considering locale's existing
> behaviour, maybe not) would be to use our own number parser.

There's a patch for that pending on a Python tracker (to make Python do
this -- Python's locale hacks simply don't work well, notwithstanding that
they work much better than having no locale hacks at all in the core).
Unfortunately, it sucks in a bunch of code from glibc, and it's unlikely we
can accept it in that form.

> Refuse to allow both a period and a comma in a number.  If either is
> given, assume it is the decimal.  Always write the config file in 'us'
> format.  There are a few variations on this we could take, depending
> on what we wanted to support, but I thought I would get it out there
> for discussion first.
>
> Any thoughts?

It may be a start, but I'm afraid it's impossible to solve all the deep
problems without changing the Python implementation.  As I detailed in a
(very) recent msg to the spambayes list, the locale in effect also
determines how float literals are loaded out of .pyc files.  Python
currently assumes and requires that "C" locale be in effect for the
LC_NUMERIC category; there are more dire warnings <0.5 wink> in the "For
extension writers and programs that embed Python" section of Python's locale
module docs.




More information about the spambayes-dev mailing list