How to print first(national) char from unicode string encoded inutf-8?
Mark Tolonen
M8R-yfto6h at mailinator.com
Tue Sep 2 00:05:28 EDT 2008
"Marco Bizzarri" <marco.bizzarri at gmail.com> wrote in message
news:mailman.331.1220276398.3487.python-list at python.org...
> On Mon, Sep 1, 2008 at 3:25 PM, <sniipe at gmail.com> wrote:
>
>>
>> When I do ${urllib.unquote(c.user.firstName)} without encoding to
>> latin-1 I got different chars than I will get: no Łukasz but Å ukasz
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>
> That's crazy. "string".encode('latin1') gives you a latin1 encoded
> string; latin1 is a single byte encoding, therefore taking the first
> byte should be no problem.
>
> Have you tried:
>
> urlib.unquote(c.user.firstName)[0].encode('latin1') or
>
> urlib.unquote(c.user.firstName)[0].encode('utf8')
>
> I'm assuming here that the urlib.unquote(c.user.firstName) returns an
> encodable string (which I'm absolutely not sure), but if it does, this
> should take the first 'character'.
The OP stated that the original string was "encoded in UTF-8 and
urllib.quote()", so after urllib.unquote the string is in UTF-8 format.
This must be decoded into a Unicode string before removing the first
character:
urllib.unquote(c.user.firstName).decode('utf-8')[0]
The next problem is that the character in the OP's example string 'Ł' is not
present in the latin-1 encoding, but using utf-8 encoding demonstrates that
the full two-byte UTF-8 encoded character is collected:
>>> import urllib
>>> name = urllib.quote(u'Łukasz'.encode('utf-8'))
>>> name
'%C5%81ukasz'
>>> urllib.unquote(name).decode('utf-8')[0].encode('utf-8')
'\xc5\x81'
-Mark
More information about the Python-list
mailing list