best split tokens?

MonkeeSage MonkeeSage at gmail.com
Fri Sep 8 20:15:42 EDT 2006


John Machin wrote:
> Not picking on Tim in particular; try the following with *all*
> suggestions so far:
>
> textbox = "He was wont to be alarmed/amused by answers that won't work"

Not perfect, but would work for many cases:

s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&*?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))

Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html




More information about the Python-list mailing list