[portland] Umlats from another dimension
Scott Garman
sgarman at zenlinux.com
Mon Oct 13 00:01:29 CEST 2014
Hi all,
I'm getting pretty confused by a problem I'm trying to solve in python,
which is to detect lower-case characters in a string. This would
normally be a simple regex, but I have to also accept input strings with
umlats in them, such as 'ä'. I'm using python 2.7.6.
At first I thought this was a unicode problem, but now I'm not so sure.
About anything.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
str = 'ä'
if isinstance(str, unicode):
print "This is unicode"
Running this tells me that string is *not* unicode. I know that there's
a thing called extended ASCII, and if I look up a table for that, I see
characters with accents and umlats:
http://www.asciitable.com/
This table suggests that 'ä' should correspond to an ordinal value of
132. But if I run:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
string = 'ä'
for c in string:
print ord(c)
I get:
195
164
which tells me that I'm dealing with a two-byte character, which brings
me back to this being unicode.
Now looking at which characters in the extended ASCII table correspond
to those values, I don't see any relation to 'ä'.
Finally, my understanding of python 2.x is that it does not support
unicode in regexes. Otherwise I'd just use \p{Ll} and have a good deal
more hair left on my head.
I've also tried forcing the string to ASCII using:
str.decode("ascii", "ignore")
and this is one of those characters that just gets dropped in the
conversion.
Any insights on what I'm missing would be greatly appreciated.
Thanks,
Scott
More information about the Portland
mailing list