[portland] Umlats from another dimension

Mon Oct 13 00:33:51 CEST 2014

The first important concept to understand is that UTF-8 and Unicode are not
the same thing.

Because you specified coding: utf-8, every string you define within the
python script is a bytestring encoded using utf-8.  This is not the same as
a python unicode object, it is a bytestring (because you are using python
2.x).

The reason that 'ä' produces two character ordinals is that utf-8 is
variable in character length.  195+164 is the code point for 'ä'.  If you
want the python unicode object for the string, use mystring.decode('utf-8')
instead of 'ascii', because it's not ascii.

The second important concept is that strings defined within the python
script may not be the same type as strings read from input, a file, a web
request, etc.  Where is your input coming from?

If you can be sure your input is utf-8 (and this is a giant leap if you're
working with web input), convert it to unicode (via .decode()), iterate
over the unicode sequence and test each character with .islower().

If you can't be sure what encoding your bytestrings are in, check out the
chardet library on pypi.

On Sun, Oct 12, 2014 at 3:01 PM, Scott Garman <sgarman at zenlinux.com> wrote:

> Hi all,
>
> I'm getting pretty confused by a problem I'm trying to solve in python,
> which is to detect lower-case characters in a string. This would normally
> be a simple regex, but I have to also accept input strings with umlats in
> them, such as 'ä'. I'm using python 2.7.6.
>
> At first I thought this was a unicode problem, but now I'm not so sure.
> About anything.
>
> #!/usr/bin/env python
> # -*- coding: utf-8 -*-
>
> str = 'ä'
>
> if isinstance(str, unicode):
>         print "This is unicode"
>
> Running this tells me that string is *not* unicode. I know that there's a
> thing called extended ASCII, and if I look up a table for that, I see
> characters with accents and umlats:
>
> http://www.asciitable.com/
>
> This table suggests that 'ä' should correspond to an ordinal value of 132.
> But if I run:
>
> #!/usr/bin/env python
> # -*- coding: utf-8 -*-
>
> string = 'ä'
>
> for c in string:
>     print ord(c)
>
> I get:
>
> 195
> 164
>
> which tells me that I'm dealing with a two-byte character, which brings me
> back to this being unicode.
>
> Now looking at which characters in the extended ASCII table correspond to
> those values, I don't see any relation to 'ä'.
>
> Finally, my understanding of python 2.x is that it does not support
> unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal more
> hair left on my head.
>
> I've also tried forcing the string to ASCII using:
>
> str.decode("ascii", "ignore")
>
> and this is one of those characters that just gets dropped in the
> conversion.
>
> Any insights on what I'm missing would be greatly appreciated.
>
> Thanks,
>
> Scott
>
> _______________________________________________
> Portland mailing list
> Portland at python.org
> https://mail.python.org/mailman/listinfo/portland
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/portland/attachments/20141012/68b05074/attachment.html>