[Tutor] FASTA FILE SUB-SEQUENCE EXTRACTION
syed zaidi
syedzaidi85 at hotmail.co.uk
Tue Mar 8 07:19:47 EST 2016
Well, fasta is a file format used by biologists to store biological sequencesthe format is as under> sequence information (sequence name, sequence length etc)genomic sequence> sequence information (sequence name, sequence length etc)genomic sequenceI want to match the name of sequence with another list of sequence names and splice the sequence by the provided list of start and end sites for each sequenceso the pseudo code could beif line starts with '>': match the header name with sequence name: if sequence name found: splice from the given start and end positions of that sequence the code I have devised so far is:import oswith open('E:/scaftig.sample - Copy.scaftig','r') as f: header = f.readline() header = header.rstrip(os.linesep) sequence = '' for line in f: line = line.rstrip('\n') if line[0] == '>': header = header[:] print header if line[0] != '>': sequence+= line print sequence, len(sequence)I would appreciate if you can helpThanksBest RegardsAli
> Date: Tue, 8 Mar 2016 03:11:42 -0500
> Subject: Re: [Tutor] FASTA FILE SUB-SEQUENCE EXTRACTION
> From: wolfrage8765 at gmail.com
> To: syedzaidi85 at hotmail.co.uk
>
> What is FASTA? This seems very specific. Do you have any code thus far
> that is failing?
>
> On Tue, Mar 8, 2016 at 2:33 AM, syed zaidi <syedzaidi85 at hotmail.co.uk> wrote:
> > Hello all,
> > I am stuck in a problem, I hope someone can help me out. I have a FASTA file with multiple sequences and another file with the gene coordinates. SAMPLEFASTA FILE:
> >>EBM_revised_C2034_1 length=611GCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATT>EBM_revised_C2104_1 length=923TCCGAGGGCGGTGGGATGTTGGTGCTGCAGCGGCTTTCGGATGCGCGGCGGTTGGGTCATCCGGTGTTGGCGGTGGTGGTCGGGTCGGCGGTTAATCAGGATGGGGCGTCGAATGGGTTGACCGCGCCTAATGGTCCTTCGCAGCAGCGGGTGGTGCGGGCGGCGTTGGCCAATGCCGGGTTGAGCGCGGCCGAGGTGGATGTGGTGGAGGGGCATGGGACCGGGACCACGTTGGGGGATCCGATTGAGGCTCAGGCGTTGTTGGCCACTTATGGGCAAGATCGGGGGGAGCCGGGAGAACCTTTGTGGTTGGGGTCGGTGAA
> > GTCGAATATGGGTCATACGCAGGCCGCGGCGGGGGTGGCCGGGGTGATCAAGATGGTGTTGGCGATGCGCCATGAGCTGTTGCCGGCGACGTTGCACGTGGATGTGCCTAGCCCGCATGTGGATTGGTCGGCGGGGGCGGTGGAGTTGTTGACCGCGCCGCGGGTGTGGCCTGCTGGTGCTCGGACGCGTCGTGCGGGGGTGTCGTCGTTTGGGATTAGTGGCACTAATGCGCATGTGATTATCGAGGCGGTGCCGGTGGTGCCGCGGCGGGAGGCTGGTTGGGCGGGGCCGGTGGTGCCGTGGGTGGTGTCGGCGAAGTCGGAGTCGGCGTTGCGGGGGCAGGCGGCTCGGTTGGCCGCGTACGTGCGTGGCGATGATGGCCTCGATGTTGCCGATGTGGGGTGGTCGTTGGCGGGTCGTTCGGTTTTTGAGCATCGGGCGGTGGTGGTTGGCGGGGACCGTGATCGGTTGTTGGCCGGGCTCGATGAGCTGGCGGGTGACCAGTTGGGCGGCTCGGTTGTTCGGGGCACGGCGACTGCGGCGGGTAAGACGGTGTTCGTCTTCCCCGGCCAAGGCTCCCAATGGCTGGGCATGGGAAT
> > GENE COORD FILEScaf_name Gene_name DS_St DS_EnEBM_revised_C2034_1 gene1_1 33 99EBM_revised_C2034_1 gene1_1 55 100EBM_revised_C2034_1 gene1_1 111 150EBM_revised_C2104_1 gene1_1 44 70
> > I want to perform the following steps:compare the scaf_name with the header of fasta sequenceif header matches then process the sequence and extract the sequence by the provided start and end positions.
> >
> > I would appreciate if someone can help
> > Thanks
> > Best Regards
> >
> > Ali
> >
> >> _______________________________________________
> >> Tutor maillist - Tutor at python.org
> >> To unsubscribe or change subscription options:
> >> https://mail.python.org/mailman/listinfo/tutor
> >
> > _______________________________________________
> > Tutor maillist - Tutor at python.org
> > To unsubscribe or change subscription options:
> > https://mail.python.org/mailman/listinfo/tutor
More information about the Tutor
mailing list