[portland] Umlats from another dimension

Scott Garman sgarman at zenlinux.com
Mon Oct 13 00:38:39 CEST 2014


On 10/12/2014 03:33 PM, Sam Thompson wrote:
> The first important concept to understand is that UTF-8 and Unicode are not
> the same thing.
>
> Because you specified coding: utf-8, every string you define within the
> python script is a bytestring encoded using utf-8.  This is not the same as
> a python unicode object, it is a bytestring (because you are using python
> 2.x).
>
> The reason that 'ä' produces two character ordinals is that utf-8 is
> variable in character length.  195+164 is the code point for 'ä'.  If you
> want the python unicode object for the string, use mystring.decode('utf-8')
> instead of 'ascii', because it's not ascii.
>
> The second important concept is that strings defined within the python
> script may not be the same type as strings read from input, a file, a web
> request, etc.  Where is your input coming from?
>
> If you can be sure your input is utf-8 (and this is a giant leap if you're
> working with web input), convert it to unicode (via .decode()), iterate
> over the unicode sequence and test each character with .islower().
>
> If you can't be sure what encoding your bytestrings are in, check out the
> chardet library on pypi.

Thanks Sam, this explanation helped to fill in my gaps on bytestrings 
and unicode in python, which until now I've been quite clueless about.

Scott



More information about the Portland mailing list