translating foreign data

Richard Damon richard.damon at 1
Sat Jun 23 10:42:28 EDT 2018


From: Richard Damon <Richard at Damon-Family.org>

On 6/23/18 9:05 AM, Marko Rauhamaa wrote:
> Richard Damon <Richard at Damon-Family.org>:
>
>> On 6/23/18 8:03 AM, Marko Rauhamaa wrote:
>>> I always know my locale. The locale is tied to the human user.
>> No, it should be tied to the data you are processing.
>    In computing, a locale is a set of parameters that defines the user's
>    language, region and any special variant preferences that the user
>    wants to see in their user interface.
>
>    <URL: https://en.wikipedia.org/wiki/Locale_(computer_software)>
>
> The data should not depend on the locale.
So no one foreign ever gives you data? Note, that wikipedia article is focused
on the SYSTEM locale, which yes, that should reflect the what the user wants in
 his interface.
>
>> If an English user is feeding a program Chinese documents, while
>> processing those documents the program should be using the appropriate
>> Chinese Locale.
> Not true.
How else is the program going to understand the Chinese data?
>
>> Again, no, a locale is tied to the data, not the user (unless you want
>> to require the user to translate all data to his locale conventions
>> (without using a program that can use locale information) before
>> providing it to a program. Yes, the default for the interpretation
>> should be the users default/current locale, but you really want them
>> to be able to say I got this file from someone whose locale was
>> different than mine.
> The locale is not directly related to data or data formats. Of course,
> locales leak into data and create the sorry mess we are talking about.
The fact that locale issues leak into data is the reason that the single
immutable global locale doesn't work. You really want to imbue into data
streams what locale their data represents (and use that in some of the later
processing of data from that stream).
>
>> Data presented to the user should normally use his locale (unless he
>> has specified something different).
> Ok. Here's a value for you:
>
>     100ΓΘ¼
>
> I see '1', '0', '0', 'ΓΘ¼'. What do you see in your locale (LC_MONETARY)?
If I processed that on my system I would either get $100, or an error of wrong
currency symbol depending on the error checking.
>
>>> BTW, I think the locale is a terrible invention.
>> The locale is a lot better than the alternative, where every
>> application that needs to deal with internationalization need to
>> recreate (and debub) all of the mechanism. I agree it isn't perfect,
>> and for small simple programs it would be nice to be able to say "I
>> don't want all this stuff, make it go away".
> The locale doesn't solve a single problem in practice and often trips up
> programs. For example, a customer-visible bug was once caused by:
>
>    sort <identifiers.txt
>
> producing different results on different customers' machines.
>
> Mental note: *always* prefix GNU textutils commands with LANG=C.
Yes, one issue is that systems currently don't naturally tag data with the
locale to use (you can't even know for sure character set a file is in, so your
 example above might be 100 some funny character(s). It is starting be true
that you can often assume UTF-8 (at least on Linux, on Windows it is much less
so), and validating that it is valid UTF-8 is a pretty good sign that it is.
>
>> Python took its locale (at least initially) from C, which was a single
>> global which does have more issues because of this.
> The single global is due to what the locale was introduced for. It came
> about around the time when Unix applications were being made "8-bit
> clean." Along with UCS-2 and XML, it's one of those things you wish
> you'd never have to deal with.
>
>
> Marko

Locale predates UCS-2, it was the early attempt to provide internationalization
 to C code so even programmers who didn't think about it could add the line
setlocale(LC_ALL, "") and make their code work at least mostly right in more
places. A single global was quick and simple, and since threads didn't exist,
not an issue.

In many ways it was the first attempt that should have been thrown away, but
got too intertwined. C++ made a significant improvement to it by having streams
 remember their own locale.

--
Richard Damon

--- BBBS/Li6 v4.10 Toy-3
 * Origin: Prism bbs (1:261/38)



More information about the Python-list mailing list