a -very- case sensitive search

Paul McGuire ptmcg at austin.rr._bogus_.com
Sun Nov 26 06:11:35 EST 2006


"Ola K" <olakh at walla.co.il> wrote in message 
news:1164490795.866046.133230 at 45g2000cws.googlegroups.com...
> Hi,
> I am pretty new to Python and I want to make a script that will search
> for the following options:
> 1) words made of uppercase characters -only- (like "YES")
> 2) words made of lowercase character -only- (like "yes")
> 3) and words with only the first letter capitalized (like "Yes")
> * and I need to do all these considering the fact that not all letters
> are indeed English letters.
>
> I went through different documention section but couldn't find a right
> condition, function or method for it.
> Suggestions will be very much appriciated...
> --Ola
>
Ola,

You may be new to Python, but are you new to regular expressions too?  I am 
no wiz at them, but here is a script that takes a stab at what you are 
trying to do. (For more regular expression info, see 
http://www.amk.ca/python/howto/regex/.)

The script has these steps:
- create strings containing all unicode chars that are considered "lower" 
and "upper", using the unicode.is* methods
- use these strings to construct 3 regular expressions (or "re"s), one for 
words of all lowercase letters, one for words of all uppercase letters, and 
one for words that start with an uppercase letter followed by at least one 
lowercase letter.
- use each re to search the string u"YES yes Yes", and print the found 
matches

I've used unicode strings throughout, so this should be applicable to your 
text consisting of letters beyond the basic Latin set (since Outlook Express 
is trying to install Israeli fonts when reading your post, I assume these 
are the characters you are trying to handle).  You may have to do some setup 
of your locale for proper handling of unicode.isupper, etc., but I hope this 
gives you a jump start on your problem.

-- Paul


import sys
import re

uppers = u"".join( unichr(i) for i in range(sys.maxunicode)
                    if unichr(i).isupper() )
lowers = u"".join( unichr(i) for i in range(sys.maxunicode)
                    if unichr(i).islower() )

allUpperRe = ur"\b[%s]+\b" % uppers
allLowerRe = ur"\b[%s]+\b" % lowers
capWordRe = ur"\b[%s][%s]+\b" % (uppers,lowers)

regexes = [
    (allUpperRe, "all upper"),
    (allLowerRe, "all lower"),
    (capWordRe, "title case"),
    ]
for reString,label in regexes:
    reg = re.compile(reString)
    result = reg.findall(u" YES  yes Yes ")
    print label,":",result

Prints:
all upper : [u'YES']
all lower : [u'yes']
title case : [u'Yes']





More information about the Python-list mailing list