Python 3.3, gettext and Unicode problems

Sun Dec 30 20:48:08 EST 2012

On 12/30/2012 7:39 PM, Marcel Rodrigues wrote:
> I'm using Python 3.3 (CPython) and am having trouble getting the
> standard gettext module to handle Unicode messages.

I have never even looked at the doc before, but I will take a look.

> My problem can be isolated as follows:
>
> I have 3 files in a folder: greeting.py, greeting.po and msgfmt.py.
>
> -- greeting.py --
> import gettext
>
> t = gettext.translation("greeting", "locale", ["pt"])
> _ = t.lgettext

gettext.lgettext(message)
Equivalent to gettext(), but the translation is returned in the 
preferred system encoding, if no other encoding was explicitly set with 
bind_textdomain_codeset().

Giving that 'preferred system encoding' apparent means 
'locale.getpreferredencoding' and that seems to not be what you want, 
why are you using the 'l' version?

>
> print("_charset = {0}\n".format(t._charset))
> print(_("hello"))

A strong suggestion: whenever you want to print a string and the 
computation of the string (or bytes) involves encoding/decoding, 
separate the computation and the printing (on two separate line).

s = _("hello")
print(s)

The reason is that printing also requires encoding for the output device 
and that process can also generate a UnicodeError that may be hard to 
distinguish from an error in the computation of s itself.

> -- EOF --
>
> -- greeting.po --
> msgid ""
> msgstr ""
> "Project-Id-Version: 1.0\n"
> "MIME-Version: 1.0\n"
> "Content-Type: text/plain; charset=UTF-8\n"
> "Content-Transfer-Encoding: 8bit\n"
>
> msgid "hello"
> msgstr "olá"
> -- EOF --
>
> msgfmt.py was downloaded from
> http://hg.python.org/cpython/file/9e6ead98762e/Tools/i18n/msgfmt.py,
> since this tool apparently isn't included in the python3 package
> available on Arch Linux official repositories.
>
> It's probably also worth noting that the file greeting.po is encoded
> itself as UTF-8.
>
>  From that folder, I run the following commands:
>
> $ mkdir -p locale/pt/LC_MESSAGES
> $ python msgfmt.py -o !$/greeting.mo greeting.po
> $ python greeting.py
>
> The output is:
> _charset = UTF-8
>
> Traceback (most recent call last):
>    File "greeting.py", line 7, in <module>
>      print(_("hello"))
>    File "/usr/lib/python3.3/gettext.py", line 314, in lgettext
>      return tmsg.encode(locale.getpreferredencoding())
> UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in
> position 2: ordinal not in range(128)

In particular, we have seen, in previous posts here, this exact error 
generated during printing rather than during the string computation and 
posters have wasted time looking for the error in the string or bytes 
computation itself.

> My interpretation of this output is that even though gettext correctly
> detects the MO file charset as UTF-8, it tries to encode the translated
> message with the system's "preferred encoding", which happens to be ASCII.

Just as you seem to have requested ;-)

> Anyone know why this happens? Is this a bug on my code? Maybe I have
> misunderstood gettext...

You used lgettext (l = locale). As I said, I am new to this.

-- 
Terry Jan Reedy