[Tutor] NLTK

Ishan Puri ballerz4ishi at sbcglobal.net
Sat Aug 29 04:16:53 CEST 2009


Hi,
    Thanks for the confirmation. IM50re.txt is a plain text corpus. Let us say that we want to count the words in this corpus. In the NLTK book, there is an example.

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt',
'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']

These are the texts that come with NLTK.

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427

So this is the number of words in a particular 'austen-emma.txt'. How would I do this 
with my IM50re.txt? It seems the code "nltk.corpus.gutenberg.words" is specific to some Gutenberg corpus installed with NLTK. 
Like this many examples are given for different analyses that can be done with NLTK. However they all seem to be specific
to one of the texts above or another one already installed with NLTK. I am not sure how to apply these examples to my own corpus.

        Thank you. You are my own source of help right now; I have been  trying to figure this out all day now.




________________________________
From: Kent Johnson <kent37 at tds.net>
To: Ishan Puri <ballerz4ishi at sbcglobal.net>
Cc: *tutor python <tutor at python.org>
Sent: Friday, August 28, 2009 7:03:15 PM
Subject: Re: [Tutor] NLTK

On Fri, Aug 28, 2009 at 7:29 PM, Ishan Puri<ballerz4ishi at sbcglobal.net> wrote:
> Hi,
>>>> from nltk.corpus import PlaintextCorpusReader
>>>> corpus_root='C:\Users\Ishan\Documents'
>>>> wordlists = PlaintextCorpusReader(corpus_root, 'IM50re.txt')
>>>> wordlists.fileids()
> ['IM50re.txt']
>
> This is the result I get.

That seems to be working then. You should be able to get a list of words with
wordlists.words('IM50re.txt')

> I was wondering how I can use the packages on
> IM50re.txt? I followed successfully the steps detailed under Using Your Own
> Corpus. What do I do next, say, if I wanted to use the lemmatizer on this
> .txt document?

I have no idea. Is IM50re.txt a plain text corpus? What is a package?
What is a lemmatizer?

I don't know anything about NLTK, I'm just good at reading manuals.
You have to give me more help than that. What have you tried? Can you
find an example that is similar to what you want to do? Don't assume I
know what you are talking about :-)

Kent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090828/f1e1f8bc/attachment.htm>


More information about the Tutor mailing list