Wildcard String Comparisons: Set Pattern to a Wildcard Source

MRAB python at mrabarnett.plus.com
Tue Oct 5 15:38:54 EDT 2010


On 05/10/2010 20:03, chaoticcranium at gmail.com wrote:
> So, I have a rather tricky string comparison problem: I want to search
> for a set pattern in a variable source.
>
> To give you the context, I am searching for set primer sequences
> within a variable gene sequence. In addition to the non-degenerate A/G/
> C/T, the gene sequence could have degenerate bases that could encode
> for more than one base (for example, R means A or G, N means A or G or
> C or T). One brute force way to do it would be to generate every
> single non-degenerate sequence the degenerate sequence could mean and
> do my comparison with all of those, but that would of course be very
> space and time inefficient.
>
> For the sake of simplicity, let's say I replace each degenerate base
> with a single wildcard character "?". We can do this because there are
> so many more non-degenerate bases that the probability of a degenerate
> mismatch is low if the nondegenerates in a primer match up.
>
> So, my goal is to search for a small, set pattern (the primer) inside
> a large source with single wildcard characters (my degenerate gene).
>
> The first thing that comes to my mind are regular expressions, but I'm
> rather n00bish when it comes to using them and I've only been able to
> find help online where the smaller search pattern has wildcards and
> the source is constant, such as here:
> http://www.velocityreviews.com/forums/t337057-efficient-string-lookup.html
>
> Of course, that's the reverse of my situation and the proposed
> solutions there won't work for me. So, could you help me out, oh great
> Python masters? *bows*

Stand back, I'm going to try regex. :-)

Both "A" and "R" in the variable sequence should match "A" in the
primer sequence, so "A" in the primer sequence should be replaced by
the character set "[AR]". The other bases should be replaced similarly.

Use a simple dict lookup:

wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}

and create the regex for the primer sequence:

primer_pattern = re.compile("".join(wildcards[c] for c in primer))

Would that work?



More information about the Python-list mailing list