Python 3.3, gettext and Unicode problems

Sun Dec 30 21:26:22 EST 2012

On 12/30/2012 8:48 PM, Terry Reedy wrote:
> On 12/30/2012 7:39 PM, Marcel Rodrigues wrote:
>> I'm using Python 3.3 (CPython) and am having trouble getting the
>> standard gettext module to handle Unicode messages.

Addition to previous response.

>> import gettext
>>
>> t = gettext.translation("greeting", "locale", ["pt"])

Reading further, I see that this returns a GNUTranslations instance

>> _ = t.lgettext

So this calls its method:
'''
GNUTranslations.gettext(message)
Look up the message id in the catalog and return the corresponding 
message string, as a Unicode string. If there is no entry in the catalog 
for the message id, and a fallback has been set, the look up is 
forwarded to the fallback’s gettext() method. Otherwise, the message id 
is returned.

GNUTranslations.lgettext(message)
Equivalent to gettext(), but the translation is returned as a bytestring 
encoded in the selected output charset, or in the preferred system 
encoding if no encoding was explicitly set with set_output_charset().
'''
So if you want the unicode translation to be utf-8 encoded, either use 
.gettext and encode it yourself, or use "t.set_output_charset('utf-8')" 
to have it done automatically.

 >> print("_charset = {0}\n".format(t._charset))
 >> print(_("hello"))

But since you are printing to screen, I suggest using .gettext and let 
print do the encoding to the screen encoding. If that still raises an 
encoding error, then the problem is the console emulator. On windows, 
for instance, IDLE windows handle the entire BMP charset while the 
stupid Windows Command Prompt window does not (certainly not by default, 
and not yet, as far I know).

The encoding of the translations file on disk determines how the bytes 
of the translation table should be *decoded when read, to create unicode 
strings. It does not determine how those strings should be *encoded* 
when sent to a particular destination. That may depend on the 
destination. Multilingual international sites used to encode pages in 
different limited national encodings, according to the language and 
destination. Now many encode and send *everything* as utf-8. I think 
this is the proper policy now. .lgettext seems oriented to the older, 
pre utf-8, national locale encoding way of doing things.

-- 
Terry Jan Reedy