[Tutor] about a program

Bob Gailer bgailer@alum.rpi.edu
Wed Mar 19 13:08:01 2003


--=======4521194F=======
Content-Type: text/plain; x-avg-checked=avg-ok-2DDA7BEE; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 8bit

At 06:05 AM 3/19/2003 -0800, Abdirizak abdi wrote:
>buf = re.compile("[a-zA-Z]+\s+")
>
>this was to match the followint string:
>
>str = 'Data sparseness is an inherent problem in statistical methods for 
>natural language processing.'
>
>Result: ['Data', 'sparseness', 'is', 'an', 'inherent', 'problem', 'in', 
>'statistical', '
>methods', 'for', 'natural', 'language']
>
>the result is that, it gets all the tokens except the last one with the 
>processing+ dot (full stop at the back)

The problem is that \s+ expects whitespace after each word. There is no 
whitespace after 'processing'. Also you should put the pattern in a raw 
string, otherwise some \x sequences will be taken as special character.

One solution is to specify whitespace OR end of string: buf = 
re.compile(r"[a-zA-Z]+(?:\s+|$)"). \s+|$ says whitespace OR end of string. 
I put that in () due to the precedence of |, and added ?: to make it a "A 
non-grouping version of regular parentheses."

A completely different approach is to use \b to match start or end of word: 
buf = re.compile(r"\b[a-zA-Z]+\b").

If you just want to create a list of space separated words, str.split(').

Bob Gailer
PLEASE NOTE NEW EMAIL ADDRESS bgailer@alum.rpi.edu
303 442 2625

--=======4521194F=======
Content-Type: text/plain; charset=us-ascii; x-avg=cert; x-avg-checked=avg-ok-2DDA7BEE
Content-Disposition: inline


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.463 / Virus Database: 262 - Release Date: 3/17/2003

--=======4521194F=======--