[Tutor] how to convert between type string and token

Mon Nov 14 19:20:13 EST 2005

On Mon, 14 Nov 2005, enas khalil wrote:

>     hello all

[program cut]

Hi Enas,

You may want to try talking with NTLK folks about this, as what you're
dealing with is a specialized subject.  Also, have you gone through the
tokenization tutorial in:

    http://nltk.sourceforge.net/tutorial/tokenization/nochunks.html#AEN276

and have you tried to compare your program to the ones in the tutorial's
examples?

Let's look at the error message.

>   File "F:\MSC first Chapters\unigramgtag1.py", line 14, in -toplevel-
>     for tok in train_tokens: mytagger.train(tok)
>   File "C:\Python24\Lib\site-packages\nltk\tagger\__init__.py", line 324, in train
>     assert chktype(1, tagged_token, Token)
>   File "C:\Python24\Lib\site-packages\nltk\chktype.py", line 316, in chktype
>     raise TypeError(errstr)
> TypeError:
>     Argument 1 to train() must have type: Token
>       (got a str)

This error message implies that each element in your train_tokens list is
a string and not a token.

The 'train_tokens' variable gets its values in the block of code:

###########################################
train_tokens = []
xx=Token(TEXT=open('fataha2.txt').read())
WhitespaceTokenizer().tokenize(xx)
for l in xx:
    train_tokens.append(l)
###########################################

Ok.  I see something suspicious here.  The for loop:

######
for l in xx:
    train_tokens.append(l)
######

assumes that we get tokens from the 'xx' token.  Is this true?  Are you
sure you don't have to specifically say:

######
for l in xx['SUBTOKENS']:
    ...
######

The example in the tutorial explicitely does something like this to
iterate across the subtokens of a token.  But what you're doing instead is
to iterate across all the property names of a token, which is almost
certainly not what you want.