[I18n-sig] Passing unicode strings to file system calls

18 Jul 2002 00:11:10 +0200

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > - it may not know what variables to consider. In particular, on Unix,
> >   it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
> >   a number of errors when trying to find the encoding:
> 
> That's the search order which GNU readline uses (at least
> at the time I wrote the code).

GNU readline does not check LANGUAGE, and it uses setlocale if
available (so you are talking about rarely-used fallback code).

> >   - it misses that LANGUAGE can contain contain colons to denote
> >     fallbacks, on GNU/Linux; with
> >     LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
> >     ['de_DE', 'french']
> >     This is even worse: french is not the name of an encoding
> 
> Interesting. Is the format documented somewhere ? It should be
> easy to fix this.

Of LANGUAGE? I believe it's documented in the gettext documentation.

> > - it may not know the syntax of the environment variables. For
> >   example, the current implementation breaks for "de_DE@euro"; this is
> >   an SF bug report.
> 
> This should be fixable too. What does the '@euro' mean ? Does it
> have to do with currency ?

In a way. It is a "locale variant". A variant could be just about
anything. Common variants are @euro (used to denote the variant that
has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two
Norwegian languages - now nb and no), and @xim, used for X Input
Methods (like @xim=kinput2). It could be used for many other things,
too.

You can fix the parsing of the variants, but you cannot infer the
encoding.

> Sure, but you normally only get the locale name and then
> have to make an educated guess for the encoding. 

That is my point: This algorithm must guess, and it *will* guess
wrong.

> If the encoding is known (e.g. by looking at the LANG environment
> variable), then that infomration should override the database
> information.

In this specific case (of the @euro domains), the LANG variable does
not explicitly mention the encoding. So that doesn't help.

> Hmm, the names returned by getdefaultlocale() and normalize()
> are standards. I wonder what Windows expects to see for
> setlocale().

What standards? Posix? That has never impressed Microsoft. Instead of
"fr_FR.cp1252", they accept "French_France.1252". That may even be
Posix-conforming, though, which allows "<lang>_<country>.<codeset>".

Locale names are *not* standard. An algorithm that assumes that they
are is broken.

> I'd say, it's better than nothing :-)

Yes, that's why I propose to provide a replacement, and then deprecate
the existing function.

Regards,
Martin