[Tutor] Textparsing, a better way?

Michael Janssen Janssen@rz.uni-frankfurt.de
Tue May 6 15:10:01 2003


On Tue, 6 May 2003, Zak Arntson wrote:

> I'm working on my text adventure text parser (think Zork), and have
> created the following code to turn a sentence into a list of words and
> punctuation. E.g.: "Sailor, throw me the bottle. Get bottle" ->
> ['sailor',',','throw','me','the','bottle','.','get','bottle']
>
> Here's my current code, but I can't help thinking there are areas for
> improvement. Any suggestions/comments? I couldn't find a way for a regular
> expression to create a list of all of its matches. I'd love to do

the expression is OK, but you need to use re.findall instead of re.search.
The disadvantage (but enough for many uses) of findall is, that it only
returns a list of results (matched strings) and not matchobject with them
nice features.

> something like re.compile ('(\w+)|([\.,:;])') and have that drive
> something to make a list of all occuring blocks of that reg exp.
>
> ###
> def textparse (rawSentence):
>     sentence = []
>
>     reWord = re.compile (r'([\.,:;])')

"compile" is only an optimisation, when doing it once for many operations.
You can do this easily in global namespace for many calls of "textparse"
(OTOH global namespace is not the best place to put all kind of stuff). Or
you should simply use re.search([non-precompiled-expression], [string])
syntax.

>     for chunk in re.compile (r'\s').split (rawSentence.strip ().lower ()):
>   # first get rid of whitespace
>         for word in reWord.split (chunk):   # now separate puncuation from
> words
>             if word:
>                 sentence.append (word)

I don't understand everything what you do here (seems, you try to archive
behaviour of unknown-for-you findall via split - interesting)  but:
sentence = reWord.findall(rawSentence)

is possibly aequivalent. Or maybe:
sentence = reWord.findall(rawSentence.lower())

Whitespace isn't within results, cause reWord matches no whitespace.

Michael
>
>     return sentence
> ###
>
> --
> Zak Arntson
> www.harlekin-maus.com - Games - Lots of 'em
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>