[Tutor] FASTA parsing, biological sequence analysis

Mon Mar 24 19:58:48 CET 2014

Hi Jumana,

Following up.  Let's change the subject line.  This makes it much
easier for folks to see that this is a new topic of conversation.

[Apologies to the others on the list for my last reply: I didn't
realize that the subject was wrong, as well as the long quoted digest.
 I'll try to be more careful next time.]

Jumana, I would strongly suggest separating string parsing issues from
computational issues.  The suggestion to use Biopython is twofold: not
only do you get to avoid writing a FASTA parser, but it gets you in
the right mindset of processing _multiple_ sequences.

You are encountering this problem, as your comment suggests:

> I wrote a program close to what Denis suggested , however it works only if I
> have one sequence (one header and one sequence), I can not modify it to work
> if I have several sequences (like above).

You want the structure of your program to do an analysis on each
biological sequence, rather than on just on each character of your
sequence.

###
### pseudocode below: #
###
from Bio import SeqIO
import sys

def doAnalysis(record):
    print("I see: ", record.id, record.seq)
    ## fill me in

for record in SeqIO.parse(sys.stdin, 'fasta'):
    doAnalysis(record)
###

And you can fill in the details of doAnalysis() so that it does the
nucleotide counting and only needs to worry about the contents of the
record's single sequence.

In bioinformatics contexts, you must either deal with memory
consumption, or use libraries that naturally lend to doing things in a
memory-careful way, or else your computer will start swapping RAM.  At
least, unless your data sets are trivial, which I am guessing is not
the case.

In short, please use the BioPython library.  It will handle a lot of
issues that you are not considering, including memory consumption and
correct, stream-oriented parsing of FASTA.