Looking for a regexp generator based on a set of known string representative of a string set

vbfoobar at gmail.com vbfoobar at gmail.com
Sat Sep 9 02:59:26 EDT 2006


James Stroud a écrit :

> vbfoobar at gmail.com wrote:
> > Hello
> >
> > I am looking for python code that takes as input a list of strings
> > (most similar,
> > but not necessarily, and rather short: say not longer than 50 chars)
> > and that computes and outputs the python regular expression that
> > matches
> > these string values (not necessarily strictly, perhaps the code is able
> > to determine
> > patterns, i.e. families of strings...).
> >
> > Thanks for any idea
> >
>
> I'm not sure your application, but Genomicists and Proteomicists have
> found that Hidden Markov Models can be very powerful for developing
> pattern models. Perhaps have a look at "Biological Sequence Analysis" by
> Durbin et al.
>
> Also, a very cool regex based algorithm was developed at IBM:
>
>     http://cbcsrv.watson.ibm.com/Tspd.html

Indeed, this seems cool! Thanks for the suggestion

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3

>
> But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
> Eddy and colleagues:
>
>      http://hmmer.janelia.org/
>
>      http://selab.janelia.org/people/eddys/

I will look at that more precisely, but at my first look
it seems this is more specialized and less accessible
for the common mortal...
>
> James

Thanks. This may help me.

In addition I continue to look for other ideas, notably
because I want code that I can change myself,
and exclusively python code


>
> --
> James Stroud
> UCLA-DOE Institute for Genomics and Proteomics
> Box 951570
> Los Angeles, CA 90095
> 
> http://www.jamesstroud.com/




More information about the Python-list mailing list