[Tutor] regular expression question

D Elliott debe at comp.leeds.ac.uk
Thu Apr 7 14:01:39 CEST 2005


I wonder if anyone can help me with an RE. I also wonder if there is an RE 
mailing list anywhere - I haven't managed to find one.

I'm trying to use this regular expression to delete particular strings 
from a file before tokenising it.

I want to delete all strings that have a full stop (period) when it is not 
at the beginning or end of a word, and also when it is not followed by a 
closing bracket. I want to delete file names (eg. fileX.doc), and websites 
(when www/http not given) but not file extensions (eg. this is in .jpg 
format). I also don't want to delete the last word of each sentence just 
because it precedes a fullstop, or if there's a fullstop followed by a 
closing bracket.

fullstopRe = re.compile (r'\S+\.[^)}]]+')

I've also tried 
fullstopRe = re.compile (r'\S+[.][^)}]]+')


I understand this to represent - any character one or more times, a full 
stop (I'm using the backslash, or putting it in a character class to make 
it literal), then any character but not any kind of closing bracket, one 
or more times.

If I forget about the bracket exceptions, the following works:
fullstopRe = re.compile (r'\S+[.]\S+')

But the scripts above are not deleting eg. bbc.co.uk

Can anyone enlighten me?
Thanks
Debbie


-- 
***************************************************
Debbie Elliott
Computer Vision and Language Research Group,
School of Computing,
University of Leeds,
Leeds LS2 9JT
United Kingdom.
Tel: 0113 3437288
Email: debe at comp.leeds.ac.uk
***************************************************


More information about the Tutor mailing list