best split tokens?

Fri Sep 8 22:02:54 EDT 2006

>> 	rgx = re.compile('\W+')
>>
>> if you don't mind numbers included you text (in the event you
>> have things like "fatal1ty", "thing2", or "pdf2txt") which is
>> often the case...they should be considered part of the word.
>>
>> If that's a problem, you should be able to use
>>
>> 	rgx = re.compile('[^a-zA-Z]+')
>>
>> This is a bit Euro-centric...
> 
> I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to 
"Ango-centric", as there are clearly oddball cases in languages 
such as French.

> textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like

 >>> s
"He was wont to be alarmed/amused by answers that won't work"
 >>> s2
"The two-faced liar--a real joker--can't tell the truth"
 >>> r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
 >>> r.findall(s), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by', 
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar', 
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])

which parses your example the way I would want it to be parsed, 
and handles the strange string I came up with to try similar 
examples the way I would expect that it would be broken down by 
"words"...

I had a hard time comin' up with any words I'd want to call 
"words" where the additional non-word glyph (apostrophe, dash, 
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc