[Tutor] FASTA parsing, biological sequence analysis
Danny Yoo
dyoo at hashcollision.org
Tue Apr 1 05:04:39 CEST 2014
On Tue, Mar 25, 2014 at 8:36 AM, Sydney Shall <s.shall at virginmedia.com> wrote:
> I did not know about biopython, but then I am a debutant.
> I tried to import biopython and I get the message that the name is unknown.
No problem. It is an external library; I hope that you were able to
find it! I just want to make sure no one else tries to write yet
another FASTA parser badly. It's all too easy to code something
quick-and-dirty that almost solves the issue. The devil's in the
details.
It might be instructive to look at source code. You can look at:
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py
and see all the implementation details the Biopython community has had
to consider in the real world.
These include things like skipping crazy garbage at the beginning of files,
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L40-L45
and providing a stream-like interface by using generators (using the
"yield" command):
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L65
But also consider data validation facilities. At least, the Biopython
folks have. They provide a way to declare the genomic alphabet to be
used:
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py#L73
https://github.com/biopython/biopython/blob/master/Bio/Alphabet/
where if the input data doesn't match the allowed alphabet, you'll get
a good warning about it ahead of time. This is checked in places
like:
https://github.com/biopython/biopython/blob/master/Bio/Alphabet/__init__.py#L375
https://github.com/biopython/biopython/blob/master/Bio/Seq.py#L336
In short, in the presence of potentially messy data, the developers
have thought about these sorts of issues and have programmed for those
situations.
As the commit history demonstrates:
https://github.com/biopython/biopython/commits/master
they started work in the last century or so (since at least
1999-12-07), and continue to work on it even now. So taking advantage
of their generous and hard work is a good idea.
More information about the Tutor
mailing list