Wildcard String Comparisons: Set Pattern to a Wildcard Source

Tue Oct 5 16:06:37 EDT 2010

On Oct 5, 3:38 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> On 05/10/2010 20:03, chaoticcran... at gmail.com wrote:
>
>
>
> > So, I have a rather tricky string comparison problem: I want to search
> > for a set pattern in a variable source.
>
> > To give you the context, I am searching for set primer sequences
> > within a variable gene sequence. In addition to the non-degenerate A/G/
> > C/T, the gene sequence could have degenerate bases that could encode
> > for more than one base (for example, R means A or G, N means A or G or
> > C or T). One brute force way to do it would be to generate every
> > single non-degenerate sequence the degenerate sequence could mean and
> > do my comparison with all of those, but that would of course be very
> > space and time inefficient.
>
> > For the sake of simplicity, let's say I replace each degenerate base
> > with a single wildcard character "?". We can do this because there are
> > so many more non-degenerate bases that the probability of a degenerate
> > mismatch is low if the nondegenerates in a primer match up.
>
> > So, my goal is to search for a small, set pattern (the primer) inside
> > a large source with single wildcard characters (my degenerate gene).
>
> > The first thing that comes to my mind are regular expressions, but I'm
> > rather n00bish when it comes to using them and I've only been able to
> > find help online where the smaller search pattern has wildcards and
> > the source is constant, such as here:
> >http://www.velocityreviews.com/forums/t337057-efficient-string-lookup...
>
> > Of course, that's the reverse of my situation and the proposed
> > solutions there won't work for me. So, could you help me out, oh great
> > Python masters? *bows*
>
> Stand back, I'm going to try regex. :-)
>
> Both "A" and "R" in the variable sequence should match "A" in the
> primer sequence, so "A" in the primer sequence should be replaced by
> the character set "[AR]". The other bases should be replaced similarly.
>
> Use a simple dict lookup:
>
> wildcards = {"A": "[ARN]", "G": "[GRN]", "C": "[CN]", "T": "[TN]"}
>
> and create the regex for the primer sequence:
>
> primer_pattern = re.compile("".join(wildcards[c] for c in primer))
>
> Would that work?

Thank you for your response, MRAB.

That's a rather clever way to do this sort of matching, but I actually
forgot one other crucial thing in my problem description (and I'm
hitting myself on the head for forgetting it!) - I need to know at
what position in my gene the primer was found.

As far as I know (and I'm a regex n00b, so please tell me if I'm
wrong), you can't use string's find() on a regex and regex's match()
does not return a position in the regex. I understand there are
elements of in regular expressions that expand to variable numbers of
characters so a "position number" in a regular expression is often a
meaningless concept. Here, however, my regular expression has a 1 to 1
correspondence since each degenerate base should occupy only one
wildcard slot. In this particular case, a position number is
meaningful AND I need to know it for my program.

Now. . .is there anything we can do about that?