Case tagging and python

Thu Jul 31 16:11:08 EDT 2008

Hi, I came up with the following procedure

ALLCAPS = "|ALLCAPS"
NOCAPS = "|NOCAPS"
MIDCAPS = "|MIDCAPS"
CAPS = "|CAPS"
DIGIT = "|DIGIT"

def test_case(w):

     w_out = ''

     if w.isalpha(): #se la virgola non ci entra
         if w.isupper():
             w_out = w.lower() + ALLCAPS
             return w_out
         elif w.islower():
             w_out = w + NOCAPS
             return w_out
         else:
             m = re.match("^[A-Z]",w)
             if m:
                 w_out = w.lower() + CAPS #notsure about this..
                 return w_out
             else:
                 w_out = w.lower() + MIDCAPS
                 return w_out
     elif w.isdigit():
         w_out = w + DIGIT
         return w_out

Called in here:
#=========================
    lines = 0
     for s in file:
         lines += 1
         if lines % 1000 == 0:
             print '%d lines' % lines
         #sent = sent.replace(",","")
         sent = s.split() #split string by spaces
         for w in sent:
             wout= test_case(w)
#==========================

But I don't know if I'm doing something sensible? Moreover:

- test_case has problems, cause whenever It finds some punctuation 
character attached to some word, doesn't tag it. I was thinking of 
cleaning the line of the punctuation before using split on it (see 
commented row) but I don't know if I have to call that replace() once 
for every punctuation char?
-Is there a way to reprint the tagged text in a file including punctuation?
-Is my test_case a good start? Would you use regular expressions?

Thanks very much!
F.