LANG, locale, unicode, setup.py and Debian packaging

Sun Jan 13 06:26:17 EST 2008

> I have found that os.listdir() does not always return unicode objects when 
> passed a unicode path. Sometimes "byte strings" are returned in the list, 
> mixed-in with unicodes.

Yes. It does so when it fails to decode the byte string according to the
file system encoding (which, in turn, bases on the locale).

> I will try the technique given 
> on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html#guessing-the-encoding
> Perhaps that will help.

I would advise against such a strategy. Instead, you should first
understand what the encodings of the file names actually *are*, on
a real system, and draw conclusions from that.

> I gather you mean that I should get a unicode path, encode it to a byte string 
> and then pass that to os.listdir
> Then, I suppose, I will have to decode each resulting byte string (via the 
> detect routines mentioned in the link above) back into unicode - passing 
> those I simply cannot interpret.

That's what I meant, yes. Again, you have a number of options - passing
those that you cannot interpret is but one option. Another option is to
accept moji-bake.

>> Then, if the locale's encoding cannot decode the file names, you have
>> several options
>> a) don't try to interpret the file names as character strings, i.e.
>>    don't decode them. Not sure why you need the file names - if it's
>>    only to open the files, and never to present the file name to the
>>    user, not decoding them might be feasible
> So, you reckon I should stick to byte-strings for the low-level file open 
> stuff? It's a little complicated by my using Python Imaging to access the 
> font files. It hands it all over to Freetype and really leaves my sphere of 
> savvy.
> I'll do some testing with PIL and byte-string filenames. I wish my memory was 
> better, I'm pretty sure I've been down that road and all my results kept 
> pushing me to stick to unicode objects as far as possible.

I would be surprised if PIL/freetype would not support byte string file
names if you read those directly from the disk. OTOH, if the user has
selected/typed a string at a GUI, and you encode that - I can easily
see how that might have failed.

>> That's correct, and there is no solution (not in Python, not in any
>> other programming language). You have to made trade-offs. For that,
>> you need to analyze precisely what your requirements are.
> I would say the requirements are:
> 1. To open font files from any source (locale.)
> 2. To display their filename on the gui and the console.
> 3. To fetch some text meta-info (family etc.) via PIL/Freetype and display 
> same.
> 4. To write the path and filename to text files.
> 5. To make soft links (path + filename) to another path.
> 
> So, there's a lot of unicode + unicode and os.path.join and so forth going on.

I notice that this doesn't include "to allow the user to enter file
names", so it seems there is no input of file names, only output.

Then I suggest this technique of keeping bytestring/unicode string
pairs. Use the Unicode string for display, and the byte string for
accessing the disc.

>>> I went through this exercise recently and had no joy. It seems the string
>>> I chose to use simply would not render - even under 'ignore' and
>>> 'replace'.
>> I don't understand what "would not render" means.
> I meant it would not print the name, but constantly throws ascii related 
> errors.

That cannot be. Both the ignore and the replace error handlers will
silence all decoding errors.

>  I don't know if the character will survive this email, but the text I was 
> trying to display (under LANG=C) in a python script (not the immediate-mode 
> interpreter) was: "MÖgul". The second character is a capital O with an umlaut 
> (double-dots I think) above it. For some reason I could not get that to 
> display as "M?gul" or "Mgul".

I see no problem with that:

>>> u"M\xd6gul".encode("ascii","ignore")
'Mgul'
>>> u"M\xd6gul".encode("ascii","replace")
'M?gul'

Regards,
Martin