a -very- case sensitive search

Sun Nov 26 06:46:06 EST 2006

Paul McGuire wrote:
> "Ola K" <olakh at walla.co.il> wrote in message
> news:1164490795.866046.133230 at 45g2000cws.googlegroups.com...
> > Hi,
> > I am pretty new to Python and I want to make a script that will search
> > for the following options:
> > 1) words made of uppercase characters -only- (like "YES")
> > 2) words made of lowercase character -only- (like "yes")
> > 3) and words with only the first letter capitalized (like "Yes")
> > * and I need to do all these considering the fact that not all letters
> > are indeed English letters.
> >
> > I went through different documention section but couldn't find a right
> > condition, function or method for it.
> > Suggestions will be very much appriciated...
> > --Ola
> >
> Ola,
>
> You may be new to Python, but are you new to regular expressions too?  I am
> no wiz at them, but here is a script that takes a stab at what you are
> trying to do. (For more regular expression info, see
> http://www.amk.ca/python/howto/regex/.)
>
> The script has these steps:
> - create strings containing all unicode chars that are considered "lower"
> and "upper", using the unicode.is* methods
> - use these strings to construct 3 regular expressions (or "re"s), one for
> words of all lowercase letters, one for words of all uppercase letters, and
> one for words that start with an uppercase letter followed by at least one
> lowercase letter.
> - use each re to search the string u"YES yes Yes", and print the found
> matches
>
> I've used unicode strings throughout, so this should be applicable to your
> text consisting of letters beyond the basic Latin set (since Outlook Express
> is trying to install Israeli fonts when reading your post, I assume these
> are the characters you are trying to handle).

I'd guessed the OP was in Israel from his e-mail address. If that's
what Outlook Express is doing, then that's conclusive proof :-)

An aside to the OP: Pardon my ignorance, but does Hebrew have upper and
lower case?

> You may have to do some setup
> of your locale for proper handling of unicode.isupper, etc.,

Whatever gave you that impression?

> but I hope this
> gives you a jump start on your problem.
>
> -- Paul
>
>
> import sys
> import re
>
> uppers = u"".join( unichr(i) for i in range(sys.maxunicode)
>                     if unichr(i).isupper() )
> lowers = u"".join( unichr(i) for i in range(sys.maxunicode)
>                     if unichr(i).islower() )

Just in case the OP is running a 32-bit unicode implementation, you
might want to make that xrange, not range :-)

>
> allUpperRe = ur"\b[%s]+\b" % uppers
> allLowerRe = ur"\b[%s]+\b" % lowers
> capWordRe = ur"\b[%s][%s]+\b" % (uppers,lowers)
>
> regexes = [
>     (allUpperRe, "all upper"),
>     (allLowerRe, "all lower"),
>     (capWordRe, "title case"),
>     ]
> for reString,label in regexes:
>     reg = re.compile(reString)
>     result = reg.findall(u" YES  yes Yes ")
>     print label,":",result
>
> Prints:
> all upper : [u'YES']
> all lower : [u'yes']
> title case : [u'Yes']

Cheers,
John