Case tagging and python

Thu Jul 31 07:00:52 EDT 2008

Hi,

I'm relatively new to programming in general, and totally new to python,
and I've been told that this language is particularly good for what I
need to do. Let me explain.
I have a large corpus of English text, in the form of several files.

First of all I would like to scan each file. Then, for each word I find,
I'd like to examine its case status, and write the (lower case) word back
to another text file - with, appended, a tag stating the case it had in 
the original file.

An example. Suppose we have three possible "case conditions"
-all lowercase
-all uppercase
-initial uppercase only

Three corresponding tags for each of these might be, respectively:
-nocap
-allcaps
-cap

Therefore, given the string

"The Chairman of BP was asleep"

I would like to produce

"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"

and writing this into a file.

I have the following algorithm in mind:

-open input file
-open output file
-get line of text
	-split line into words
	-for each word
		-tag = checkCase(word)
		-newword = lowercase(word) + append(tag)
	rejoin words into line
	write line into output file

Now, I managed to write the following initial code

    for s in file:
         lines += 1
         if lines % 1000 == 0:
             print '%d lines' % We print the total lines
         sent = s.split() #split string by spaces
#...

But then I don't quite know what would be the fastest/best way to do 
this. Could I use the join function to reform the string? And, regarding 
the casetest() function, what do you suggest to do? Should I test each 
character of each word or there are faster methods?

Thanks very much,

F.