regex multiple patterns in order
Roy Smith
roy at panix.com
Mon Jan 20 09:52:59 EST 2014
In article <mailman.5748.1390216721.18130.python-list at python.org>,
Ben Finney <ben+python at benfinney.id.au> wrote:
> With a little experimenting I get:
>
> >>> p = re.compile('((?:CAA)+)?((?:TCT)+)?((?:TA)+)?')
> >>> p.findall('CAACAACAATCTTCTTCTTCTTATATA')
> [('CAACAACAA', 'TCTTCTTCTTCT', 'TATATA'), ('', '', '')]
Perhaps a matter of style, but I would have left off the ?: markers and
done this:
p = re.compile('((CAA)+)((TCT)+)((TA)+)')
m = p.match('CAACAACAATCTTCTTCTTCTTATATA')
print m.groups()
$ python r.py
('CAACAACAA', 'CAA', 'TCTTCTTCTTCT', 'TCT', 'TATATA', 'TA')
The ?: says, "match this group, but don't save it". The advantage of
that is you don't get unwanted groups in your match object. The
disadvantage is they make the pattern more difficult to read. My
personal opinion is I'd rather make the pattern easier to read and just
ignore the extra matches in the output (in this case, I want groups 0,
2, and 4).
I also left off the outer ?s, because I think this better represents the
intent. The pattern '((CAA)+)?((TCT)+)?((TA)+)?' matches, for example,
an empty string; I suspect that's not what was intended.
> Be aware that regex is not the solution to all parsing problems; for
> many parsing problems it is an attractive but inappropriate tool. You
> may need to construct a more specific parser for your needs. Even if
> it's possible with regex, the resulting pattern may be so complex that
> it's better to write it out more explicitly.
Oh, posh.
You are correct; regex is not the solution to all parsing problems, but
it is a powerful tool which people should be encouraged to learn. For
some problems, it is indeed the correct tool, and this seems like one of
them. Discouraging people from learning about regexes is an educational
anti-pattern which I see distressingly often on this newsgroup.
Several lives ago, I worked in a molecular biology lab writing programs
to analyze DNA sequences. Here's a common problem: "Find all the places
where GACGTC or TTCGAA (or any of a similar set of 100 or so short
patterns appear". I can't think of an easier way to represent that in
code than a regex.
Sure, it'll be a huge regex, which may take a long time to compile, but
one of the nice things about these sorts of problems) is that the
patterns you are looking for tend not to change very often. For
example, the problem I mention in the preceding paragraph is finding
restriction sites, i.e. the locations where restriction enzymes will cut
a strand of DNA. There's a finite set of commercially available
restriction enzymes, and that list doesn't change from month to month
(at this point, maybe even from year to year).
For more details, see
http://bioinformatics.oxfordjournals.org/content/4/4/459.abstract
Executive summary: I wrote my own regex compiler which was optimized for
the types of patterns this problem required.
More information about the Python-list
mailing list