[Tutor] Opening Multiple Files

Fri Aug 17 10:08:09 CEST 2007

Paulo Quaglio wrote:
> Hi everyone,
> Thanks for all suggestions. Let me just preface this by saying that 
> I’m new to both python and programming. I started learning 3 months 
> ago with online tutorials and reading the questions you guys post. So, 
> thank you all very, very much…and I apologize if I’m doing something 
> really stupid..:-) OK. I’ve solved the problem of opening several 
> files to process “as a batch” with glob.glob(). Only now did I realize 
> that the program and files need to be in the same folder…. Now I have 
> another problem.
> 1- I want to open several files and count the total number of words. 
> If I do this with only 1 file, it works great. With several files ( 
> now with glob), it outputs the total count for each file individually 
> and not the whole corpus (see comment in the program below).
> 2- I also want the program to output a word frequency list (we do this 
> a lot in corpus linguistics). When I do this with only one file, the 
> program works great (with a dictionary). With several files, I end up 
> with several frequency lists, one for each file. This sounds like a 
> loop type of problem, doesn’t it? I looked at the indentations too and 
> I can’t find what the problem is. Your comments, suggestions, etc are 
> greatly appreciated. Thanks again for all your help. Paulo
> Here goes what I have.
I'm going to make some general observations as well as try to answer 
your original question.
> # The program is intended to output a word frequency list (including 
> all words in all files) and the total word count
> def sortfile(): # I created a function
I think analyze_corpus or something in this vein would be a better name, 
because this doesn't sort files as far as I can tell.
> filename = glob.glob('*.txt') # this works great! Thanks!
This is not really a file name but a list, so filelist would be more 
appropriate, or just files
> for allfiles in filename:
'allfiles' is not a very appropriate name. Each time through the loop, 
the variable 'allfiles' contains a different item from the 'filename' 
list. it doesn't contain all the files simultaneously. 'fileobj' or 
something of this sort would be better. 'file' is a reserved keyword, 
however.
> infile = open(allfiles, 'r')
> lines = list(infile)
It would be more clear to me what you're doing if you used the file 
object method 'readlines'
lines = infile.readlines()
This might also be more efficient, or it might be less efficient, or it 
might be doing the same thing.
readlines is what I usually see here, though.
> infile.close()
> words = [] # initializes list of words
> wordcounter = 0
> for line in lines:
here your iterator name is more appropriate, because it's singular 
(which it should be, because it's a single object it's bound to - 
exceptions would be when iterating over lists of lists, for example)
> line = line.lower() # after this, I have some clunky code to get rid 
> of punctuation
> words = words + line.split()
You should probably use the 'append' method of the list, 
words.append(line.split())
As far as removing punctuation, you can just build your whole words list 
in one line, like so:
words = [''.join([x for x in word if x.isalpha()]) for word in line.split()]
replace isalpha with isalnum if you want to match numbers too.
Note that this makes blank words as a side-effect. It's too late in the 
night for me to come up with a fix for this,
but you probably shouldn't use that list comprehension anyway, 
especially if you have no idea what it's doing.
> wordfreq = [words.count(wrd)for wrd in words] # counts the freq of 
> each word in a list
put a space between 'words.count(wrd)' and 'for' so it's clearer what's 
happening.
I'm assuming words.count is defined somewhere else that you didn't include?
I don't know if list comprehension is the best way to do this, because 
words could contain redundant entries.
> dictionary = dict(zip(words, wordfreq))
> frequency_list = [(dictionary[key], key)for key in dictionary]
Why do you need to make a dictionary here? why not just use the 
zip(words,wordfreq) directly?
> frequency_list.sort()
> frequency_list.reverse()
> for item in frequency_list:
> wordcounter = wordcounter + 1
> print item
> print "Total # of words:", wordcounter # this will give the word count 
> of the last file the program reads.
> print "Total # of words:", wordcounter # if I put it here, I get the 
> total count after each file
Don't just put it random places, think about what your loop is doing and 
how to fix the problem.
Your outer-most loop is looping over each file. within this loop you 
count the number of words in the file.
Logically, for a total tally of the number of words in all files, you'd 
have a variable defined before the start of the loop,
and then add the tally of each file's words to the total tally variable. 
Does this make sense? Can you figure out how to do this?

Not sure what your frequency problem is. Try to abstract what your code 
is doing to as high a level as you can, and it should be easier to 
understand.
> sortfile()


P.S. next time use a more standard font or include your code as an 
attachment. preferably the attachment.
It was hard to read, and I suspect this will reduce your number of 
replies a decent amount.
I wouldn't have even read this if someone else had replied already, 
simply because the font is hard to read.
-Luke