[Tutor] German Umlaut

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon, 1 Apr 2002 12:29:18 -0800 (PST)


> Reading your email I changed my regex to:
>
> reg =3D re.compile(r"\b[A-Z]\w+-? und [A-Z]\w+\b",re.UNICODE)
>
> But this still doesn't match nouns like "=DCbung", i.d. the capitel lette=
r
> is an umlaut.How can I deal with that??


Hmmm... Is it possible to relax the regular expression a little?  Instead
of forcing a "capitalized" word, would this be feasible:

###
reg =3D re.compile(r"\b\w+-? und \w+\b",re.UNICODE)
###



Otherwise, we can stuff in 'string.uppercase' in the regular expression.
Here's one way to do it with some string formatting:

###
reg =3D re.compile(r"\b[%s]\w+-? und [%s]\w+\b"
                  % (string.uppercase, string.uppercase), re.UNICODE)
###


This is probably what you want, since:

###
>>> string.uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ\xc0\xc1\xc2\xc3\xc4\xc5\xc6
\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8
\xd9\xda\xdb\xdc\xdd\xde'
###

appears to contain all those characters.


Good luck!  I hope this helps.