NLTK

mausg at mail.com mausg at mail.com
Wed Aug 8 17:03:58 EDT 2018


On 2018-08-07, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> Steven D'Aprano <steve+comp.lang.python at pearwood.info> writes:
>>In natural language, words are more complicated than just space-separated 
>>units. Some languages don't use spaces as a word delimiter.
>
>   Even above, the word »units« is neither directly preceded
>   nor directly followed by a space.
>
>   In the end, one can make an arbitrary choice about where one
>   wants to place the border between syntax and morphology.
>
>   For the case of English, I can define a word to be a
>   sequence of letters (including the apostrophe), that is
>   sorrounded by non-letter characters.
>
>>Recognising open compound words is difficult. "Real estate" is an open 
>>compound word, but "real cheese" and "my estate" are both two words.
>
>   This is just a part of the more general problem to parse and
>   interpret a sentence. It is not more difficult than the 
>   interpretation of other pairs of words in a sentence.
>
>>Another problem for English speakers is deciding whether to treat 
>>contractions as a single word, or split them?
>>"don't" --> "do" "n't"
>>"they'll" --> "they" "'ll"
>
>   They are a single word by my definition. But this is just
>   the surface of the input. The input could be translated into
>   a "deep-structure" intermediate language that than splits
>   some source words into several units or joins some source
>   words into a single unit.
>
>>Punctuation marks should either be stripped out of sentences before 
>>splitting into words, or treated as distinct tokens. We don't want 
>>"tokens" and "tokens." to be treated as distinct words, just because one 
>>happened to fall at the end of a sentence and one didn't.
>
>   Yes, but this is quite trivial compared to the problem
>   of parsing and interpreting a natural-language sentence. 
>

Thanks all for the replies. It seems that I do not really need NLTK.
split() will do me. Again Thanks


-- 
Maus at ireland.com
Will Rant For Food



More information about the Python-list mailing list