LANG, locale, unicode, setup.py and Debian packaging

Sun Jan 13 02:30:07 EST 2008

Martin,
I really appreciate your reply. I have been working in a vacuum on this and 
without any experience. I hope you don't mind if I ask you a bunch of 
questions. If I can get over some conceptual 'humps' then I'm sure I can 
produce a better app.

> That's a bug in the app. It shouldn't assume that environment variables
> are UTF-8. Instead, it should assume that they are in the locale's
> encoding, and compute that encoding with locale.getpreferredencoding.
I see what you are saying and agree, and I am confused about files and 
filenames. My app has to handle font files which can come from anywhere. If 
the locale (locale.getpreferredencoding) returns something like "ANSI" and I 
am doing an os.listdir() then I lose the plot a little...

 It seems to me that filenames are like snapshots of the locales where they 
originated. If there's a font file from India and I want to open it on my 
system in South Africa (and I have LANG=C) then it seems that it's impossible 
to do. If I access the filename it throws a unicodeDecodeError. If I 
use 'replace' or 'ignore' then I am mangling the filename and I won't be able 
to open it.

 The same goes for adding 'foreign' filenames to paths with any kind of string 
operation. 

 My (admittedly uninformed) conception is that by forcing the app to always 
use utf8 I can access any filename in any encoding. The problem seems to be 
that I cannot know *what* encoding (and I get encode/decode mixed up still, 
very new to it all) that particular filename is in.

Am I right? Wrong? Deluded? :) Please fill me in.

> If you print non-ASCII strings to the terminal, and you can't be certain
> that the terminal supports the encoding in the string, and you can't
> reasonably deal with the exceptions, you should accept moji-bake, by
> specifying the "replace" error handler when converting strings to the
> terminal's encoding.
I went through this exercise recently and had no joy. It seems the string I 
chose to use simply would not render - even under 'ignore' and 'replace'. 
It's really frustrating because I don't speak a non-ascii language and so 
can't know if I am testing real-world strings or crazy Tolkein strings.

Another aspect of this is wxPython. My app requires the unicode build so that 
strings have some hope of displaying on the widgets. If I access a font file 
and fetch the family name - that can be encoded in any way, again unknown, 
and I want to fetch it as 'unicode' and pass it to the widgets and not worry 
about what's really going on. Given that, I thought I'd extend the 'utf8' 
only concept to the app in general. I am sure I am wrong, but I feel cornered 
at the moment.

> > 3. I made the decision to check the locale and stop the app if the return
> > from getlocale is (None,None).
> I would avoid locale.getlocale. It's a pointless function (IMO).
Could you say why?

Here's my use of it:
locale.setlocale( locale.LC_ALL, "" )
loc = locale.getlocale()[0]
if loc == None:
    loc = locale.getlocale()
if loc == (None, None):
    print localeHelp   # not utf-8 (I think)
    raise SystemExit
# Now gettext
domain = "all"
gettext.install( domain, localedir, unicode = True )
lang = gettext.translation(domain, localedir, languages = [loc] )
lang.install(unicode = True )

So, I am using getlocale to get a tuple/list (easy, no?) to pass to the 
gettext.install function.

> Your program definitely, absolutely must work in the C locale. Of
> course, you cannot have any non-ASCII characters in that locale, so
> deal with it.
This would mean cutting-out a percentage of the external font files that can 
be used by the app. Is there no modern standard regarding the LANG variable 
and locales these days? My locale -a reports a bunch of xx_XX.utf8 locales. 
Does it even make sense to use a non-utf8 locale anymore?

> If you have solved that, chances are high that it will work in other
> locales as well (but be sure to try Turkish, as that gives a
> surprising meaning to "I".lower()).
Oh boy, this gives me cold chills. I don't have the resources to start 
worrying about every single language's edge-cases. This is kind of why I was 
leaning towards a "use a utf8 locale please" approach.

\d

-- 
Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/