Looking for a regexp generator based on a set of known string representative of a string set

James Stroud jstroud at mbi.ucla.edu
Sat Sep 9 19:08:35 EDT 2006


vbfoobar at gmail.com wrote:
> I have tried their online Text-symbol Pattern Discovery
> with these input values:
> 
> cpkg-30000
> cpkg-31008
> cpkg-3000A
> cpkg-30006
> nsug-300AB
> nsug-300A2
> cpdg-30001
> nsug-300A3

Well, in the realm of sequence analysis, it is trivial to devise a regex 
for these values because they are already aligned and of fixed length. 
This is a couple of more lines than it needs to be, only so its easier 
to follow the logic. This uses throw-away groups to avoid bracketed 
sets, becuase you have dashes in your items. You might need to tweak the 
following code if you have characters special to regex in your sequences 
or if you want to use bracketed sets. The place to do this is in the 
_joiner() function.


values = """
          cpkg-30000
          cpkg-31008
          cpkg-3000A
          cpkg-30006
          nsug-300AB
          nsug-300A2
          cpdg-30001
          nsug-300A3
          """.split()

# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
   if len(c) == 1:
     # will raise KeyError for empty column
     return c.pop()
   else:
     return "(?:%s)" % '|'.join(c)

columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_strs)
exact_rgx_str = "^%s$" % rgx_str

# '(?:c|n)(?:p|s)(?:k|u|d)g-3(?:1|0)0(?:A|0)(?:A|B|1|0|3|2|6|8)'
print rgx_str


But, if you get much more complicated that this, you will definitely 
want to check out hmmer, especially if you can align your sequences.

James


-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/



More information about the Python-list mailing list