best split tokens?
MonkeeSage
MonkeeSage at gmail.com
Fri Sep 8 20:15:42 EDT 2006
John Machin wrote:
> Not picking on Tim in particular; try the following with *all*
> suggestions so far:
>
> textbox = "He was wont to be alarmed/amused by answers that won't work"
Not perfect, but would work for many cases:
s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&*?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))
Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html
More information about the Python-list
mailing list