best split tokens?

James Stroud jstroud at mbi.ucla.edu
Fri Sep 8 16:57:54 EDT 2006


Jay wrote:
> Let's say, for instance, that one was programming a spell checker or
> some other function where the contents of a string from a text-editor's
> text box needed to be split so that the resulting array has each word
> as an element.  Is there a shortcut to do this and, if not, what's the
> best and most efficient token group for the split function to achieve
> this?
> 

I'm sure this is not perfect, but it gives one the general idea.

py> import re
py> rgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py> print astr

              Four score and seven years ago, our
              forefathers, who art in heaven (hallowed be their names),
              did forthwith declare that all men are created
              to shed their mortal coils and to be given daily
              bread, even in the best of times and the worst of times.

              With liberty and justice for all.

              -William Shakespear

py> [s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers', 
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did', 
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to', 
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily', 
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the', 
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for', 
'all', 'William', 'Shakespear']


James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/



More information about the Python-list mailing list