[portland] Umlats from another dimension

Scott Garman sgarman at zenlinux.com
Mon Oct 13 00:01:29 CEST 2014


Hi all,

I'm getting pretty confused by a problem I'm trying to solve in python, 
which is to detect lower-case characters in a string. This would 
normally be a simple regex, but I have to also accept input strings with 
umlats in them, such as 'ä'. I'm using python 2.7.6.

At first I thought this was a unicode problem, but now I'm not so sure. 
About anything.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

str = 'ä'

if isinstance(str, unicode):
	print "This is unicode"

Running this tells me that string is *not* unicode. I know that there's 
a thing called extended ASCII, and if I look up a table for that, I see 
characters with accents and umlats:

http://www.asciitable.com/

This table suggests that 'ä' should correspond to an ordinal value of 
132. But if I run:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

string = 'ä'

for c in string:
     print ord(c)

I get:

195
164

which tells me that I'm dealing with a two-byte character, which brings 
me back to this being unicode.

Now looking at which characters in the extended ASCII table correspond 
to those values, I don't see any relation to 'ä'.

Finally, my understanding of python 2.x is that it does not support 
unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal 
more hair left on my head.

I've also tried forcing the string to ASCII using:

str.decode("ascii", "ignore")

and this is one of those characters that just gets dropped in the 
conversion.

Any insights on what I'm missing would be greatly appreciated.

Thanks,

Scott



More information about the Portland mailing list