[I18n-sig] ugettext charset

Sat Jun 26 22:06:53 CEST 2010

Thanks, Martin!

I understood were my issue is.. I'm mixing Unicode strings with 8-bit
strings. The later ones are equivalent to ASCII if they don't contain
any higher codepoints, but if they do then they can not be translated to
Unicode using ASCII encoding (default), then encoding has to be given.

I was basically doing something like this (where 'a' comes from
gettext):
a = unicode('ā', encoding='utf-8')
b = unicode('ā', encoding='utf-8').encode('utf-8')
a+b
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call
last)

<ipython console> in <module>()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

But 'b' is not a Unicode string anymore after encode().. that should be
called only before writing to the file.

Cheers,
Reinis

S , 2010-06-26 20:37 +0200, "Martin v. Löwis" rakstīja:
> > I'm using Python 2.6.5 and gettext. Currently ugettext() and ungettext()
> > doesn't respect 'codeset' setting
> 
> Of course not. It returns Unicode strings instead.
> 
> > and return only ASCII encoded strings.
> 
> I can't reproduce that. It certainly returns non-ASCII strings.
> 
> > Is it by design or is it a bug?
> 
> I think you misinterpret what you are seeing (although it's not really
> clear what it is that you are seeing). AFAICT, the current behavior is
> by design.
> 
> > This breaks some things, because, ASCII encoded unicode strings
> 
> This doesn't make sense. Unicode strings *cannot* be ASCII-encoded.
> They are always Unicode-encoded - that's why they are called unicode
> strings.
> 
> > are not
> > considered equivalent to unicode strings in different encodings even if
> > they contain exactly the same characters.
> 
> Unicode strings don't have different encodings. They are encoded in
> Unicode.
> 
> > And unicode() function by
> > default returns ASCII encoded strings. In this case it should get an
> > argument for encoding.
> 
> The call to unicode only applies to the msgid, not the translation.
> This should be safe, since the msgid will only contain ASCII characters.
> 
> Regards,
> Martin