[Tutor] i want to build my own arabic training corpus data and use the NLTK to deal with

enas khalil enas_khalil at yahoo.com
Fri Aug 12 10:46:15 CEST 2005

hi Danny, 
Thanks for this help 
It is now ok to tokenize my text but the next step i want is to use tagger class to tag my text with own tags how can i start this
Also for any NLTK further help is there a specific mailing list i could go on 
many thanks 

Danny Yoo <dyoo at hkn.eecs.berkeley.edu> wrote:

On Wed, 3 Aug 2005, enas khalil wrote:

> i want to build my own arabic training corpus data and use the NLTK to
> parse and make test for unkown data

Hi Enas,

By NLTK, I'll assume that you mean the Natural Language Toolkit at:


Have you gone through the introduction and tutorials from the NLTK web


> how can i build this file and make it available to treat with it using
> different NLTK classes

Your question is a bit specialized, so we may not be the best people to
ask about this.

The part that you may want to think about is how to break a corpus into a
sequence of tokens, since tokens are primarily what the NLTK classes work

This may or may not be immediately easy, depending on how much you can
take advantage of existing NLTK classes. As the documentation in NLTK

"""If we turn to languages other than English, segmenting words can be
even more of a challenge. For example, in Chinese orthography, characters
correspond to monosyllabic morphemes. Many morphemes are words in their
own right, but many words contain more than one morpheme; most of them
consist of two morphemes. However, there is no visual representation of
word boundaries in Chinese text."""

I don't know how Arabic works, so I'm not sure if the caveat above is
something that we need to worry about.

There are a few built-in NLTK tokenizers that break a corpus into tokens,
including a WhitespaceTokenizer and a RegexpTokenizer class, both
introduced here:


For example:

>>> import nltk.token
>>> mytext = nltk.token.Token(TEXT="hello world this is a test")
>>> mytext


At the moment, this is a single token. We can use a naive approach in
breaking this into words by using whitespace as our delimiter:

>>> import nltk.tokenizer
>>> nltk.tokenizer.WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(mytext)
>>> mytext
<[, , , , , ]>

And now our text is broken into a sequence of discrete tokens, where we
can now play with the 'subtokens' of our text:

>>> mytext['WORDS']
[, , , , , ]
>>> len(mytext['WORDS'])

If Arabic follows conventions that fit closely with the assumptions of
those tokenizers, you should be in good shape. Otherwise, you'll probably
have to do some work to build your own customized tokenizers.

 Start your day with Yahoo! - make it your home page 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20050812/0945a21b/attachment.htm

More information about the Tutor mailing list