[Tutor] Taking FASTA file as an input in Python 3

Sun Oct 20 22:04:45 EDT 2019

@Mats Wichmann <mats at wichmann.us>  Thanks! the code you provided works
great. I have a few more questions.  This is how I used the code:

import sys
fasta = input('insert your fasta file name here: ')

with open (fasta , "r") as DNA_sequence:
     DNA_sequence.readline()  # throw away first line
     print ("DNA_sequence: \n")
     for line in DNA_sequence:
         x = sys.stdout.write(line)

y = str(x)
print (len(y))

This allows me to type the file name as the input (Although my goal is to
be able to drag and drop the fasta file in the shell when the input message
is prompted. I don't know if there's a way for doing this?)

I used the sys,stdout.write() to remove any blank spaces and indentations
in the fasta file format so that I get a continuous string (although
sys.stdout.write() converts the text into integer. I don't understand how!
How can letters and characters be converted into integers unless we are
talking binary?)

When I run the program, it simply prints out the sequence, regardless of
whether or not I make a call to print the variable it is assigned to.

       for line in DNA_sequence:
         x = sys.stdout.write(line)

I don't think it stores the input in the variable "x" at all? How to do
this?

  y = str(x)
print (len(y))

Also, here the print(len(y)) did not print the length of the Sequence. It
printed the number 1 instead. Why so?

I am just trying to play with this code here. Eventually my goal is to be
able to take the fasta file as an input ---> overread the first line --->
convert the rest of the text as a continuous string --> store this string
into a variable,.. so that I can use it to do other things.

On Sun, Oct 20, 2019 at 12:51 PM Mats Wichmann <mats at wichmann.us> wrote:

> On 10/20/19 11:00 AM, Mihir Kharate wrote:
> > Hello,
> >
> > I want my python program to ask for an input that accepts the FASTA
> files.
> > FASTA files are a type of text files that we use in bioinformatics. The
> > first line in a FASTA file is a description about the gene it is
> encoding.
> > The data starts with the second line. An example of the fasta format
> would
> > be:
> >
> >> NC_003423.3:c429013-426160 Schizosaccharomyces pombe chromosome II,
> complete sequence
> > ATGGAAAAAATAAAACTTTTAAATGTAAAAACTCCCAATCATTATACTATTATTTTCAAGGTGGTGGCAT
> > ACTACAGCGCACTTCAACCTAACCAAAACGAACTACGAAAAGTACGAATGCTTGCTGCTGAAAGTTCTAA
> > TGTTAATGGATTATTTAAATCAGTAGTTGCTGTTTTAGATTGTGATGATGAAACGGTACTATTTTGAATT
> > ATCAATTGGGTTTGCTGACTTTGTTTACCTAGAAAGAATTGTTCATTAAAAATGACGGGAAAGCTTTGAG
> > TTTTCCGTATGACTGGAAGCTGGCAACTCATGTTATATGCGATGACTTTTCCTCTCCTAATGTACAAGAA
> >
> >
> > I found the following code online and tried to print it to see whether
> the
> > first line is overread:
> >
> >>   DNA_sequence = open ("sequence.fasta" , "r")
> >>   DNA_sequence.readline()
> >>   print ("DNA_sequence")
> >
> > However, this prints the following statement;
> >>   <_io.TextIOWrapper name='sequence.fasta' mode='r' encoding='cp1252'>
>
> You cannot have sent us the program you are actually using, because as
> written, the output must be *exactly*
>
> DNA_sequence
>
> If you are printing it without the quote marks, then you will get what
> you have pasted: DNA_sequence is the name associated with the open file
> reference, and that's exactly what it is telling you.
>
> If you want to actually print the data being read from the file, you
> will need to save a reference to it and print that.  Maybe something
> like this?:
>
> with open ("sequence.fasta" , "r") as DNA_sequence:
>      DNA_sequence.readline()  # throw away first line
>      print ("DNA_sequence")
>      for line in DNA_sequence:
>          print(line)
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>