How to print first(national) char from unicode string encoded inutf-8?

Mark Tolonen M8R-yfto6h at mailinator.com
Tue Sep 2 00:05:28 EDT 2008


"Marco Bizzarri" <marco.bizzarri at gmail.com> wrote in message 
news:mailman.331.1220276398.3487.python-list at python.org...
> On Mon, Sep 1, 2008 at 3:25 PM,  <sniipe at gmail.com> wrote:
>
>>
>> When I do ${urllib.unquote(c.user.firstName)} without encoding to
>> latin-1 I got different chars than I will get: no Łukasz but Å ukasz
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>
> That's crazy. "string".encode('latin1') gives you a latin1 encoded
> string; latin1 is a single byte encoding, therefore taking the first
> byte should be no problem.
>
> Have you tried:
>
> urlib.unquote(c.user.firstName)[0].encode('latin1') or
>
> urlib.unquote(c.user.firstName)[0].encode('utf8')
>
> I'm assuming here that the urlib.unquote(c.user.firstName) returns an
> encodable string (which I'm absolutely not sure), but if it does, this
> should take the first 'character'.

The OP stated that the original string was "encoded in UTF-8 and 
urllib.quote()", so after urllib.unquote the string is in UTF-8 format. 
This must be decoded into a Unicode string before removing the first 
character:

    urllib.unquote(c.user.firstName).decode('utf-8')[0]

The next problem is that the character in the OP's example string 'Ł' is not 
present in the latin-1 encoding, but using utf-8 encoding demonstrates that 
the full two-byte UTF-8 encoded character is collected:

    >>> import urllib
    >>> name = urllib.quote(u'Łukasz'.encode('utf-8'))
    >>> name
    '%C5%81ukasz'
    >>> urllib.unquote(name).decode('utf-8')[0].encode('utf-8')
    '\xc5\x81'

-Mark




More information about the Python-list mailing list