Byte Offsets of Tokens, Ngrams and Sentences?

Muhammad Adeel nawabadeel at gmail.com
Fri Aug 6 05:07:32 EDT 2010


Hi,

Does any one know how to tokenize a string in python that returns the
byte offsets and tokens? Moreover, the sentence splitter that returns
the sentences and byte offsets? Finally n-grams returned with byte
offsets.

Input:
This is a string.

Output:
This  0
is      5
a       8
string.   10


thanks



More information about the Python-list mailing list