regex multiple patterns in order

Mon Jan 20 09:52:59 EST 2014

In article <mailman.5748.1390216721.18130.python-list at python.org>,
 Ben Finney <ben+python at benfinney.id.au> wrote:

> With a little experimenting I get:
> 
>     >>> p = re.compile('((?:CAA)+)?((?:TCT)+)?((?:TA)+)?')
>     >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
>     [('CAACAACAA', 'TCTTCTTCTTCT', 'TATATA'), ('', '', '')]

Perhaps a matter of style, but I would have left off the ?: markers and 
done this:

p = re.compile('((CAA)+)((TCT)+)((TA)+)')
m = p.match('CAACAACAATCTTCTTCTTCTTATATA')
print m.groups()

$ python r.py
('CAACAACAA', 'CAA', 'TCTTCTTCTTCT', 'TCT', 'TATATA', 'TA')

The ?: says, "match this group, but don't save it".  The advantage of 
that is you don't get unwanted groups in your match object.  The 
disadvantage is they make the pattern more difficult to read.  My 
personal opinion is I'd rather make the pattern easier to read and just 
ignore the extra matches in the output (in this case, I want groups 0, 
2, and 4).

I also left off the outer ?s, because I think this better represents the 
intent.  The pattern '((CAA)+)?((TCT)+)?((TA)+)?' matches, for example, 
an empty string; I suspect that's not what was intended.

> Be aware that regex is not the solution to all parsing problems; for
> many parsing problems it is an attractive but inappropriate tool. You
> may need to construct a more specific parser for your needs. Even if
> it's possible with regex, the resulting pattern may be so complex that
> it's better to write it out more explicitly.

Oh, posh.

You are correct; regex is not the solution to all parsing problems, but 
it is a powerful tool which people should be encouraged to learn.  For 
some problems, it is indeed the correct tool, and this seems like one of 
them.  Discouraging people from learning about regexes is an educational 
anti-pattern which I see distressingly often on this newsgroup.

Several lives ago, I worked in a molecular biology lab writing programs 
to analyze DNA sequences.  Here's a common problem: "Find all the places 
where GACGTC or TTCGAA (or any of a similar set of 100 or so short 
patterns appear".  I can't think of an easier way to represent that in 
code than a regex.

Sure, it'll be a huge regex, which may take a long time to compile, but 
one of the nice things about these sorts of problems) is that the 
patterns you are looking for tend not to change very often.  For 
example, the problem I mention in the preceding paragraph is finding 
restriction sites, i.e. the locations where restriction enzymes will cut 
a strand of DNA.  There's a finite set of commercially available 
restriction enzymes, and that list doesn't change from month to month 
(at this point, maybe even from year to year).

For more details, see 
http://bioinformatics.oxfordjournals.org/content/4/4/459.abstract

Executive summary: I wrote my own regex compiler which was optimized for 
the types of patterns this problem required.