[Tutor] regular expressions

Wed Dec 7 18:22:00 CET 2005

On Wed, 7 Dec 2005, ps python wrote:

>  I am a new python learner. i am trying to parse a file using regular
> expressions.

Hello,

Just as an aside: parsing Genbank flat files like this is not such a good
idea, because you can get the Genbank XML files instead.  For example,
your locus NM_005417 has a perfectly good XML representation (using the
GBSeq XML format):

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&qty=1&c_start=1&list_uids=38202215&dopt=gbx&dispmax=5&sendto=

This format contains the same content as the human-readable text report,
but structured in a way that makes it easier to extract elements if we use
an XML parser like ElementTree.

    http://effbot.org/zone/element-index.htm

And even if that weren't avaliable, we might also consider using the
parsers that come with the BioPython project:

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/
http://www.biopython.org/docs/tutorial/Tutorial004.html#toc13

I guess I'm trying to say: you might not want to reinvent the wheel:
it's been done several times already.  *grin*

If you're doing this to learn regular expressions, that's fine too.  Just
be aware that those other modules are out there.

Let's look at the code.

> for line in dat:
>      a = pat1.match(line)
>      b = pat2.match(line)
>      c = pat3.match(line)
>      d = pat4.match(line)

Use the search() method, not the match() method.  match() always assumes
that the match must start at the very beginning of the line, and it'll
miss things if your pattern is in the middle somewhere.

There's a discussion about this in the Regex HOWTO:

http://www.amk.ca/python/howto/regex/
http://www.amk.ca/python/howto/regex/regex.html#SECTION000720000000000000000

If you have more questions, please feel free to ask.