NLTK

Mon Aug 6 20:41:21 EDT 2018

On Fri, 03 Aug 2018 07:49:40 +0000, mausg wrote:

> I like to analyse text. my method consisted of something like
> words=text.split(), which would split the text into space-seperated
> units. 

In natural language, words are more complicated than just space-separated 
units. Some languages don't use spaces as a word delimiter. Some don't 
use word delimiters at all. Even in English, the we have *compound words* 
which exist in three forms:

- open: "ice cream"
- closed: "notebook"
- hyphenated: "long-term"

Recognising open compound words is difficult. "Real estate" is an open 
compound word, but "real cheese" and "my estate" are both two words.

Another problem for English speakers is deciding whether to treat 
contractions as a single word, or split them?

"don't" --> "do" "n't"

"they'll" --> "they" "'ll"

Punctuation marks should either be stripped out of sentences before 
splitting into words, or treated as distinct tokens. We don't want 
"tokens" and "tokens." to be treated as distinct words, just because one 
happened to fall at the end of a sentence and one didn't.

> then I tried to use the Python NLTK library, which had alot of
> features I wanted, but using `word-tokenize' gives a different
>  answer.-
> 
> What gives?.

I'm pretty sure the function isn't called "word-tokenize". That would 
mean "word subtract tokenize" in Python code. Do you mean word_tokenize?

Have you compared the output of the two and looked at how they differ? If 
there is too much output to compare by eye, you could convert to sets and 
check the set difference.

Or try reading the documentation for word_tokenize:

http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.treebank.TreebankWordTokenizer

-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson