From rei4dan at gmail.com Sat Jun 26 20:02:25 2010 From: rei4dan at gmail.com (My Th) Date: Sat, 26 Jun 2010 21:02:25 +0300 Subject: [I18n-sig] ugettext charset Message-ID: <1277575345.17839.273.camel@RD-PC> Hi! I'm using Python 2.6.5 and gettext. Currently ugettext() and ungettext() doesn't respect 'codeset' setting and return only ASCII encoded strings. Is it by design or is it a bug? It seems that in issue tracker there is no issue about this. And as it is now it contradicts documentation, which says: "If provided, codeset will change the charset used to encode translated strings". This breaks some things, because, ASCII encoded unicode strings are not considered equivalent to unicode strings in different encodings even if they contain exactly the same characters. And unicode() function by default returns ASCII encoded strings. In this case it should get an argument for encoding. Cheers, Reinis From martin at v.loewis.de Sat Jun 26 20:37:36 2010 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 26 Jun 2010 20:37:36 +0200 Subject: [I18n-sig] ugettext charset In-Reply-To: <1277575345.17839.273.camel@RD-PC> References: <1277575345.17839.273.camel@RD-PC> Message-ID: <4C2648F0.3090805@v.loewis.de> > I'm using Python 2.6.5 and gettext. Currently ugettext() and ungettext() > doesn't respect 'codeset' setting Of course not. It returns Unicode strings instead. > and return only ASCII encoded strings. I can't reproduce that. It certainly returns non-ASCII strings. > Is it by design or is it a bug? I think you misinterpret what you are seeing (although it's not really clear what it is that you are seeing). AFAICT, the current behavior is by design. > This breaks some things, because, ASCII encoded unicode strings This doesn't make sense. Unicode strings *cannot* be ASCII-encoded. They are always Unicode-encoded - that's why they are called unicode strings. > are not > considered equivalent to unicode strings in different encodings even if > they contain exactly the same characters. Unicode strings don't have different encodings. They are encoded in Unicode. > And unicode() function by > default returns ASCII encoded strings. In this case it should get an > argument for encoding. The call to unicode only applies to the msgid, not the translation. This should be safe, since the msgid will only contain ASCII characters. Regards, Martin From rei4dan at gmail.com Sat Jun 26 22:06:53 2010 From: rei4dan at gmail.com (My Th) Date: Sat, 26 Jun 2010 23:06:53 +0300 Subject: [I18n-sig] ugettext charset In-Reply-To: <4C2648F0.3090805@v.loewis.de> References: <1277575345.17839.273.camel@RD-PC> <4C2648F0.3090805@v.loewis.de> Message-ID: <1277582813.17839.293.camel@RD-PC> Thanks, Martin! I understood were my issue is.. I'm mixing Unicode strings with 8-bit strings. The later ones are equivalent to ASCII if they don't contain any higher codepoints, but if they do then they can not be translated to Unicode using ASCII encoding (default), then encoding has to be given. I was basically doing something like this (where 'a' comes from gettext): a = unicode('?', encoding='utf-8') b = unicode('?', encoding='utf-8').encode('utf-8') a+b --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) in () UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) But 'b' is not a Unicode string anymore after encode().. that should be called only before writing to the file. Cheers, Reinis S , 2010-06-26 20:37 +0200, "Martin v. L?wis" rakst?ja: > > I'm using Python 2.6.5 and gettext. Currently ugettext() and ungettext() > > doesn't respect 'codeset' setting > > Of course not. It returns Unicode strings instead. > > > and return only ASCII encoded strings. > > I can't reproduce that. It certainly returns non-ASCII strings. > > > Is it by design or is it a bug? > > I think you misinterpret what you are seeing (although it's not really > clear what it is that you are seeing). AFAICT, the current behavior is > by design. > > > This breaks some things, because, ASCII encoded unicode strings > > This doesn't make sense. Unicode strings *cannot* be ASCII-encoded. > They are always Unicode-encoded - that's why they are called unicode > strings. > > > are not > > considered equivalent to unicode strings in different encodings even if > > they contain exactly the same characters. > > Unicode strings don't have different encodings. They are encoded in > Unicode. > > > And unicode() function by > > default returns ASCII encoded strings. In this case it should get an > > argument for encoding. > > The call to unicode only applies to the msgid, not the translation. > This should be safe, since the msgid will only contain ASCII characters. > > Regards, > Martin