NEED HELP-process words in a text file

Chris Rebert clp2 at rebertia.com
Sat Jun 18 20:16:31 EDT 2011


On Sat, Jun 18, 2011 at 4:21 PM, Cathy James <nambo4jb at gmail.com> wrote:
> Dear Python Experts,
>
> First, I'd like to convey my appreciation to you all for your support
> and contributions.  I am a Python newborn and need help with my
> function. I commented on my program as to what it should do, but
> nothing is printing. I know I am off, but not sure where. Please
> help:(
>
> import string
> def fileProcess(filename):
>    """Call the program with an argument,
>    it should treat the argument as a filename,
>    splitting it up into words, and computes the length of each word.
>    print a table showing the word count for each of the word lengths
> that has been encountered.
>    Example:
>    Length Count
>    1 16
>    2 267
>    3 267
>    4 169
>    >>>"&"
>    Length    Count
>    0    0
>    >>>
>    >>>"right."
>    Length    Count
>    5    10
>    """
>    freq = [] #empty dict to accumulate words and word length

Er, that's an empty *list*, not an empty dict. Dicts use curly braces,
i.e. {}. Lists use square brackets, i.e. [].
So:

freq = {}

>    filename=open('declaration.txt, r')

1. You should be using the passed-in filename; you're currently
ignoring the function's argument and just hardcoding the filename as
declaration.txt.
2. You're missing 2 quotes inside the open() call. It should be:
open('declaration.txt', 'r')
3. `filename` is misnamed; you're using it for a file object as
opposed to a string representing the name of the file

Taking all that into account:

f = open(filename, 'r')
for line in f:

>    for line in filename:
>        punc = string.punctuation + string.whitespace#use Python's
> built-in punctuation and whiitespace
>        for i, word in enumerate (line.replace (punc, "").lower().split()):

str.replace() does not match the characters of the needle string as a
set. Rather, it matches it as a contiguous sequence of characters. By
way of example:
>>> "abc abc abc".replace("ac", "Q") # no effect
'abc abc abc'
>>> "abc abc abc".replace("bc", "Q")
'aQ aQ aQ'

Order matters; the needle is a substring, not a set of characters.

(Jargon: needle = what you're searching for; as opposed to: haystack =
what you're searching through).

Also, since whitespace is part of punct, even if str.replace() were to
have the semantics you thought it did, you'd end up with
nospacesbetweenthewords whatsoever, making the str.split() call quite
useless.

So, to rewrite this, let's first define a helper function to remove
all the punctuation from a string:

def withoutPunct(word):
    # Lookup "list comprehensions" if you don't understand this code.
    return ''.join(char for char in word if char not in string.punctuation)

Now, rewriting everything inside the enumerate() call:
withoutPunct(word) for word in line.split()

In fact, you never use `i`, so there's not need to use enumerate in
the first place in the inner for-loop:

words = (withoutPunct(word) for word in line.split())
for word in words:

>            if word in freq:
>                freq[word] +=1 #increment current count if word already in dict
>
>            else:
>                freq[word] = 0 #if punctuation encountered,
> frequency=0 word length = 0

Problem: What about the very first time you see a word? It won't be in
freq, so you'll set its count to 0, when in fact you've now seen it
once.
Moreover, we don't care about what the words actually are; we only
care about their lengths. So freq should use word lengths, not the
actual words themselves, as keys.
Corrected version (there are several ways to do this):

length = len(word)
freq[length] = freq.get(length, 0) + 1 # See dict.get() docs for details

>        for word in freq.items():

Items returns a collection of key-value pairs, not a collection of
keys. If you just want the keys, omit the `.items()`.
Also, this seems to be indented wrong. You're running the output loop
once per line rather than once per file.
Finally, the dictionary yields its keys/items in no particular order;
based on the sample output, you'll need to sort the word lengths if
you want to output the table's rows in ascending order.

Cheers,
Chris
--
http://rebertia.com



More information about the Python-list mailing list