[portland] Umlats from another dimension

Mon Oct 13 00:38:39 CEST 2014

On 10/12/2014 03:33 PM, Sam Thompson wrote:
> The first important concept to understand is that UTF-8 and Unicode are not
> the same thing.
>
> Because you specified coding: utf-8, every string you define within the
> python script is a bytestring encoded using utf-8.  This is not the same as
> a python unicode object, it is a bytestring (because you are using python
> 2.x).
>
> The reason that 'ä' produces two character ordinals is that utf-8 is
> variable in character length.  195+164 is the code point for 'ä'.  If you
> want the python unicode object for the string, use mystring.decode('utf-8')
> instead of 'ascii', because it's not ascii.
>
> The second important concept is that strings defined within the python
> script may not be the same type as strings read from input, a file, a web
> request, etc.  Where is your input coming from?
>
> If you can be sure your input is utf-8 (and this is a giant leap if you're
> working with web input), convert it to unicode (via .decode()), iterate
> over the unicode sequence and test each character with .islower().
>
> If you can't be sure what encoding your bytestrings are in, check out the
> chardet library on pypi.

Thanks Sam, this explanation helped to fill in my gaps on bytestrings 
and unicode in python, which until now I've been quite clueless about.

Scott