Byte Offsets of Tokens, Ngrams and Sentences?

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Fri Aug 6 05:49:33 EDT 2010


En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabadeel at gmail.com>  
escribió:

> Does any one know how to tokenize a string in python that returns the
> byte offsets and tokens? Moreover, the sentence splitter that returns
> the sentences and byte offsets? Finally n-grams returned with byte
> offsets.
>
> Input:
> This is a string.
>
> Output:
> This  0
> is      5
> a       8
> string.   10

Like this?

py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
...   print g.group(), g.start()
...
This 0
is 5
a 8
string. 10

-- 
Gabriel Genellina




More information about the Python-list mailing list