best split tokens?

Tim Chase python.list at tim.thechases.com
Sat Sep 9 11:53:36 EDT 2006


>> Any more crazy examples? :)
> 
> 'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard 
practice of nesting quotes:

	John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make 
'enry and 'orace 'appy, you can change the regexp to


 >>> s="He was wont to be alarmed/amused by answers that won't work"
 >>> s2="The two-faced liar--a real joker--can't tell the truth"
 >>> s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop 
their aitches?"

 >>> r = 
re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:

 >>> r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by', 
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'], 
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n", 
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow 
apostrophes anywhere in the word, including the front or back, 
one could use:

 >>> r = 
re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")
 >>> r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by', 
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar', 
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey", 
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n", 
"'orace", 'drop', 'their', 'aitches'])


Now your spell-checker has to have the "dropped initial or 
terminal letter" locale... :)

-tkc







More information about the Python-list mailing list