[Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?

Art Kendall Art at DrKendall.org
Thu May 6 14:06:18 CEST 2010


I am running Windows 7 64bit Home premium. with quad cpus and 8G 
memory.   I am using Python 2.6.2.

I have all the Federalist Papers concatenated into one .txt file.  I 
want to prepare a file with a row for each paper and a column for each 
term. The cells would contain the count of a term in that paper.  In the 
original application in the 1950's 30 single word terms were used. I can 
now use NoteTab to get a list of all the 8708 separate words in 
allWords.txt. I can then use that data in statistical exploration of the 
set of texts.

I have the python program(?) syntax(?) script(?) below that I am using 
to learn PYTHON. The comments starting with "later" are things I will 
try to do to make this more useful. I am getting one step at at time to work

It works when the number of terms in the term list is small e.g., 10.  I 
get a file with the correct number of rows (87) and count columns (10) 
in termcounts.txt. The termcounts.txt file is not correct when I have a 
larger number of terms, e.g., 100. I get a file with only 40 rows and 
the correct number of columns.  With 8700 terms I get only 40 rows I 
need to be able to have about 8700 terms. (If this were FORTRAN I would 
say that the subscript indices were getting scrambled.)  (As I develop 
this I would like to be open-ended with the numbers of input papers and 
open ended with the number of words/terms.)



# word counts: Federalist papers

import re, textwrap
# read the combined file and split into individual papers
# later create a new version that deals with all files in a folder 
rather than having papers concatenated
alltext = file("C:/Users/Art/Desktop/fed/feder16v3.txt").readlines()
papers= re.split(r'FEDERALIST No\.'," ".join(alltext))
print len(papers)

countsfile = file("C:/Users/Art/desktop/fed/TermCounts.txt", "w")
syntaxfile = file("C:/Users/Art/desktop/fed/TermCounts.sps", "w")
# later create a python program that extracts all words instead of using 
NoteTab
termfile   = open("C:/Users/Art/Desktop/fed/allWords.txt")
termlist = termfile.readlines()
termlist = [item.rstrip("\n") for item in termlist]
print len(termlist)
# check for SPSS reserved words
varnames = textwrap.wrap(" ".join([v.lower() in ['and', 'or', 'not', 
'eq', 'ge',
'gt', 'le', 'lt', 'ne', 'all', 'by', 'to','with'] and (v+"_r") or v for 
v in termlist]))
syntaxfile.write("data list file= 
'c:/users/Art/desktop/fed/termcounts.txt' free/docnumber\n")
syntaxfile.writelines([v + "\n" for v in varnames])
syntaxfile.write(".\n")
# before using the syntax manually replace spaces internal to a string 
to underscore // replace (ltrtim(rtrim(varname))," ","_")   replace any 
special characters with @ in variable names


for p in range(len(papers)):
    counts = []
    for t in termlist:
       counts.append(len(re.findall(r"\b" + t + r"\b", papers[p], 
re.IGNORECASE)))
    if sum(counts) > 0:
       papernum = re.search("[0-9]+", papers[p]).group(0)
       countsfile.write(str(papernum) + " " + " ".join([str(s) for s in 
counts]) + "\n")


Art


More information about the Tutor mailing list