Regular Expression AND mach

Win Carus wcarus at comcast.net
Thu Mar 25 10:48:52 EST 2004


michael at foord.net (Fuzzyman) wrote in message news:<8089854e.0403220021.db39215 at posting.google.com>...
> "Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.194.1079807509.742.python-list at python.org>...
> > Fuzzyman wrote:
> > > Jeff Epler <jepler at unpythonic.net> wrote in message 
> > > news:<mailman.161.1079716498.742.python-list at python.org>...
> > > > Regular expressions are not a good tool for this purpose.
> > > 
> > > Hmm... I'm not sure if I've been helped or not :-)
> > > Thanks anyway.....
> > > 
> > > Odd that you can't do this easily with regular expressions - I suppose
> > > it doesn't compile down to a neat test.... but then it's hardly a
> > > complex search... OTOH I have *no idea* how regular expressions
> > > actually work (and no need to find out)...
>  
> > >From one Fu.*man to another ;) you do have a need to find out, even if
> > you don't recognize it. Start with A.M. Kuchling's excellent,
> > Python-based tutorial at: http://www.amk.ca/python/howto/regex/
> > 
> > At the least, you should understand why a regex is not an all-in-one
> > solution to your issue. It basically comes down to the fact that a regex
> > is geared to do its analysis in a single pass over your text. As it
> > finds partial matches, it may backtrack to try to find a complete match,
> > but in general, it moves forward. Therefore, if you want to find three
> > words in a *declared* order in your text, a single regex can do it
> > easily. If you want to find three words in *any* order, the simplest
> > solution using regexes is to perform three separate searches. There are
> > ways to get around this within a regex, but they're neither as simple
> > nor as maintainable as letting Python do the iteration:
> > 
> > >>> import re
> > >>> text = 'Some aa text cc with bb search terms.'
> > >>> search terms = ['aa', 'bb', 'cc']
> > >>> [re.findall(re.escape(word), text) for word in search terms]
> > [['aa'], ['bb'], ['cc']]
> > 
> > or, for your case:
> > 
> > >>> def has all terms(content, terms):
> > ... 	for word in terms:
> > ... 		if not re.search(re.escape(word), content):
> > ... 			return False
> > ... 	return True
> > ... 
> > >>> has all terms(text, search terms)
>  True
> > >>> has all terms('A is for aardvark.', search terms)
> > False
> > 
> 
> Thanks for your reply, it was both helpful and interesting.
> Unfortunately it only confirmed my suspicion that what I wanted to do
> wasn't possible :-(
> 
> The database 'module' I'm using (which is basically fine for my other
> purposes) does searches through it's records using regular
> expressions. A full search through 1800 records (each record can be a
> couple of k of text) takes about 0.2 seconds - which is fine.... but
> doing 4 or 5 searches and comparing results *isn't* (0.2 seconds delay
> is ok - 0.8-1.0 seconds isn't).... So I'm currently just searching for
> the longest word - KirbBase then returns the full *text* of each song
> containing that word... and I'm just checking each song to see if it
> has the other words :-)
> 
> This works fine, isn't noticeably slow, but isn't as elegant as I'd
> hoped...
> 
> Regards,
> 
> 
> Fuzzy
> 
> http://www.voidspace.org.uk/atlantibots/pythonutils.html
> 
> > 
> > HTCYTIRABM!
> > 
> > Robert Brewer
> > MIS
> > Amor Ministries
> > fumanchu at amor.org

Hi Fuzzyman.

The heuristic solution you've come up with (i.e, searching using the
longest term and then filtering/ranking the results against the
remaining terms) is in fact quite well founded in practice.  Many
years ago, Miller and Newman determined that there are two
length-frequency statistical patterns in written English, one for
content words and another for function words. The content-word pattern
is regular and demonstrates a very strong inverse relationship between
word length and word frequency (i.e., the longer the word, the less
frequent). There are, of course, many exceptions, but the pattern is
very strong; and there are interesting information theoretic arguments
why this makes sense (if you accept that human languages operate to
minimize the costs of communication). In the absence of statistics
collected on your specific body of text, this is a very useful crutch.

Miller, George A. and Edwin B. Newman. 1958. "Tests of a Statistical
Explanation of the Rank-Frequency Relation for Word in Written
English." American Journal of Psychology (71): 209-258.

Miller, George A., E. B. Newman, and E. A. Friedman. 1958.
"Length-Frequency Statistics for Written English." Information and
Control (1): 370-389.

On the other hand, it is also possible to (a) identify the full set of
candidate search terms available (this would allow you to fail quickly
on search terms that can't be found); and (b) determine their
frequencies (this would allow you to search for terms in inverse
frequency, if that's what you really want). Creating an inverted index
is, of course, a possibility, but text search is so fast and,
furthermore, you don't have the complexities of creating and
maintaining the index itself.  Wu and Manber, the developers of the
'agrep' tool (which uses some really fancy and powerful partial
matching techniques and is used by the Glimpse and Webglimpse search
engines), have argued precisely this point.

Sun Wu and Udi Manber: Fast Text Searching Allowing Errors.
Communications of the ACM, pp. 83-91, Vol. 35, No. 10, Oct. 1992, USA.

Sun Wu and Udi Manber: AGREP - A Fast Approximate Pattern-matching
Tool. Proceedings of the Winter 1992 USENIX Conference San Francisco,
USA, 20.-24. Jan. 1992, pp. 153-162, Berkeley, USA, 1991.

Udi Manber and Sun Wu: Approximate Pattern Matching. BYTE pp. 281-292,
November 1992.

There is a Python port of agrep available as a module called 'agrepy'
at the following site:

http://www.bio.cam.ac.uk/~mw263/pyagrep.html

As noted in an earlier post, regexes alone can't do what you want.
There are other techniques that might do what you're looking for, for
example, Aho-Corasick (and some other) automata.  Aho-Corasick
automata are able to 'parallelize' pattern matching so that multiple
strings can be matched against a text in a single pass. Aho-Corasick
automata are easy to implement in both their deterministic and
non-deterministic forms in Python.

Aho, A., and Corasick, M. Efficient String Matching: An Aid to
Bibliographic Search. Commun. ACM 18, 6 (June 1975), 333--340.

http://www-sr.informatik.uni-tuebingen.de/~buehler/AC/html/Automaton.html

Good luck,

Win



More information about the Python-list mailing list