[Tutor] Is the difference in outputs with different size input lists due to limits on memory with PYTHON?
Art Kendall
Art at DrKendall.org
Thu May 6 14:06:18 CEST 2010
I am running Windows 7 64bit Home premium. with quad cpus and 8G
memory. I am using Python 2.6.2.
I have all the Federalist Papers concatenated into one .txt file. I
want to prepare a file with a row for each paper and a column for each
term. The cells would contain the count of a term in that paper. In the
original application in the 1950's 30 single word terms were used. I can
now use NoteTab to get a list of all the 8708 separate words in
allWords.txt. I can then use that data in statistical exploration of the
set of texts.
I have the python program(?) syntax(?) script(?) below that I am using
to learn PYTHON. The comments starting with "later" are things I will
try to do to make this more useful. I am getting one step at at time to work
It works when the number of terms in the term list is small e.g., 10. I
get a file with the correct number of rows (87) and count columns (10)
in termcounts.txt. The termcounts.txt file is not correct when I have a
larger number of terms, e.g., 100. I get a file with only 40 rows and
the correct number of columns. With 8700 terms I get only 40 rows I
need to be able to have about 8700 terms. (If this were FORTRAN I would
say that the subscript indices were getting scrambled.) (As I develop
this I would like to be open-ended with the numbers of input papers and
open ended with the number of words/terms.)
# word counts: Federalist papers
import re, textwrap
# read the combined file and split into individual papers
# later create a new version that deals with all files in a folder
rather than having papers concatenated
alltext = file("C:/Users/Art/Desktop/fed/feder16v3.txt").readlines()
papers= re.split(r'FEDERALIST No\.'," ".join(alltext))
print len(papers)
countsfile = file("C:/Users/Art/desktop/fed/TermCounts.txt", "w")
syntaxfile = file("C:/Users/Art/desktop/fed/TermCounts.sps", "w")
# later create a python program that extracts all words instead of using
NoteTab
termfile = open("C:/Users/Art/Desktop/fed/allWords.txt")
termlist = termfile.readlines()
termlist = [item.rstrip("\n") for item in termlist]
print len(termlist)
# check for SPSS reserved words
varnames = textwrap.wrap(" ".join([v.lower() in ['and', 'or', 'not',
'eq', 'ge',
'gt', 'le', 'lt', 'ne', 'all', 'by', 'to','with'] and (v+"_r") or v for
v in termlist]))
syntaxfile.write("data list file=
'c:/users/Art/desktop/fed/termcounts.txt' free/docnumber\n")
syntaxfile.writelines([v + "\n" for v in varnames])
syntaxfile.write(".\n")
# before using the syntax manually replace spaces internal to a string
to underscore // replace (ltrtim(rtrim(varname))," ","_") replace any
special characters with @ in variable names
for p in range(len(papers)):
counts = []
for t in termlist:
counts.append(len(re.findall(r"\b" + t + r"\b", papers[p],
re.IGNORECASE)))
if sum(counts) > 0:
papernum = re.search("[0-9]+", papers[p]).group(0)
countsfile.write(str(papernum) + " " + " ".join([str(s) for s in
counts]) + "\n")
Art
More information about the Tutor
mailing list