[I18n-sig] Passing unicode strings to file system calls

Martin v. Loewis martin@v.loewis.de
18 Jul 2002 17:14:33 +0200


"M.-A. Lemburg" <mal@lemburg.com> writes:

> > You can fix the parsing of the variants, but you cannot infer the
> > encoding.
> 
> Why not ? I know that several locales use more than one
> encoding for their script(s), 

Which locale, on which system?

> but having at least a hint is better than no information at all.

Where do you get the hint from? And why is it better to guess a
random encoding than to guess "ascii" all the time?

> I've never said that it will always guess right. AFAIK,
> there is no platform independent solution to the problem.
> I am all for adding more support for platform specific
> solutions, though.

For that, I would need to understand the meaning of getdefaultlocale
first. What precisely is it supposed to return? I can understand the
"encoding" part (what encoding is the user likely to use), but what is
the meaning of the "language code" return value? And what can you do
with that result?

> > In this specific case (of the @euro domains), the LANG variable does
> > not explicitly mention the encoding. So that doesn't help.
> 
> It can be used as hint, e.g. in Germany we use Latin-1 as
> encoding, so that's a good assumption.

That is a wrong assumption. In Germany, we use windows-1252,
iso-8859-1, iso-8859-15, and UTF-8. Many modern Unix installations use
Latin-9 instead of Latin-1, since Latin-1 cannot represent the
currency symbol of the locale.

> >>I'd say, it's better than nothing :-)
> > Yes, that's why I propose to provide a replacement, and then
> > deprecate
> > the existing function.
> 
> Why a replacement and what kind of replacement ? It should well
> be possible to add more support to the existing APIs and
> perhaps extend them with new ones.

Because the other APIs have different usage constraints. It *is*
possible to find out the user's encoding reliably on many Unix
systems, but you have to invoke setlocale for that to work. Calling
setlocale behind the scenes is bad, so the users have to change their
code.

Also, this only returns the encoding. I don't know what the "language
code" is or how to obtain it - even in a system specific
way. Fortunately, I don't consider this a problem - since I can't see
why anybody would want that value, either.

Regards,
Martin