Looking for a regexp generator based on a set of known string representative of a string set
James Stroud
jstroud at mbi.ucla.edu
Sat Sep 9 19:08:35 EDT 2006
vbfoobar at gmail.com wrote:
> I have tried their online Text-symbol Pattern Discovery
> with these input values:
>
> cpkg-30000
> cpkg-31008
> cpkg-3000A
> cpkg-30006
> nsug-300AB
> nsug-300A2
> cpdg-30001
> nsug-300A3
Well, in the realm of sequence analysis, it is trivial to devise a regex
for these values because they are already aligned and of fixed length.
This is a couple of more lines than it needs to be, only so its easier
to follow the logic. This uses throw-away groups to avoid bracketed
sets, becuase you have dashes in your items. You might need to tweak the
following code if you have characters special to regex in your sequences
or if you want to use bracketed sets. The place to do this is in the
_joiner() function.
values = """
cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
""".split()
# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
if len(c) == 1:
# will raise KeyError for empty column
return c.pop()
else:
return "(?:%s)" % '|'.join(c)
columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_strs)
exact_rgx_str = "^%s$" % rgx_str
# '(?:c|n)(?:p|s)(?:k|u|d)g-3(?:1|0)0(?:A|0)(?:A|B|1|0|3|2|6|8)'
print rgx_str
But, if you get much more complicated that this, you will definitely
want to check out hmmer, especially if you can align your sequences.
James
--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
http://www.jamesstroud.com/
More information about the Python-list
mailing list