[I18n-sig] Passing unicode strings to file system calls
Martin v. Loewis
martin@v.loewis.de
18 Jul 2002 00:11:10 +0200
"M.-A. Lemburg" <mal@lemburg.com> writes:
> > - it may not know what variables to consider. In particular, on Unix,
> > it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
> > a number of errors when trying to find the encoding:
>
> That's the search order which GNU readline uses (at least
> at the time I wrote the code).
GNU readline does not check LANGUAGE, and it uses setlocale if
available (so you are talking about rarely-used fallback code).
> > - it misses that LANGUAGE can contain contain colons to denote
> > fallbacks, on GNU/Linux; with
> > LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
> > ['de_DE', 'french']
> > This is even worse: french is not the name of an encoding
>
> Interesting. Is the format documented somewhere ? It should be
> easy to fix this.
Of LANGUAGE? I believe it's documented in the gettext documentation.
> > - it may not know the syntax of the environment variables. For
> > example, the current implementation breaks for "de_DE@euro"; this is
> > an SF bug report.
>
> This should be fixable too. What does the '@euro' mean ? Does it
> have to do with currency ?
In a way. It is a "locale variant". A variant could be just about
anything. Common variants are @euro (used to denote the variant that
has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two
Norwegian languages - now nb and no), and @xim, used for X Input
Methods (like @xim=kinput2). It could be used for many other things,
too.
You can fix the parsing of the variants, but you cannot infer the
encoding.
> Sure, but you normally only get the locale name and then
> have to make an educated guess for the encoding.
That is my point: This algorithm must guess, and it *will* guess
wrong.
> If the encoding is known (e.g. by looking at the LANG environment
> variable), then that infomration should override the database
> information.
In this specific case (of the @euro domains), the LANG variable does
not explicitly mention the encoding. So that doesn't help.
> Hmm, the names returned by getdefaultlocale() and normalize()
> are standards. I wonder what Windows expects to see for
> setlocale().
What standards? Posix? That has never impressed Microsoft. Instead of
"fr_FR.cp1252", they accept "French_France.1252". That may even be
Posix-conforming, though, which allows "<lang>_<country>.<codeset>".
Locale names are *not* standard. An algorithm that assumes that they
are is broken.
> I'd say, it's better than nothing :-)
Yes, that's why I propose to provide a replacement, and then deprecate
the existing function.
Regards,
Martin