best split tokens?

John Machin sjmachin at lexicon.net
Fri Sep 8 19:02:11 EDT 2006


Tim Chase wrote:
> > py> import re
> > py> rgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
> > py> [s for s in rgx.split(astr) if s]
> > ['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
> > 'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
> > 'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
> > 'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
> > 'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
> > 'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
> > 'all', 'William', 'Shakespear']
>
> This regexp could be shortened to just
>
> 	rgx = re.compile('\W+')
>
> if you don't mind numbers included you text (in the event you
> have things like "fatal1ty", "thing2", or "pdf2txt") which is
> often the case...they should be considered part of the word.
>
> If that's a problem, you should be able to use
>
> 	rgx = re.compile('[^a-zA-Z]+')
>
> This is a bit Euro-centric...

I'd call it half-asscii :-)

> ideally Python regexps would support
> Posix character classes, so one could use
>
> 	rgx = re.compile('[^[:alpha:]]+')
>
>
> or something of the like...however, that fails on my python2.4 here.
>
> -tkc

Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

The short answer to the OP's question is that there is no short answer.
This blog note (and the papers it cites) may help ...

http://blogs.msdn.com/correcteurorthographiqueoffice/archive/2005/12/07/500807.aspx

Cheers,
John




More information about the Python-list mailing list