Issue values dictionary
alex23
wuwei23 at gmail.com
Tue Jun 4 23:17:03 EDT 2013
On Jun 5, 12:41 pm, claire morandin <claire.moran... at gmail.com> wrote:
> But I have a problem storing all size length to the value size as it is always comes back with the last entry.
> Could anyone explain to me what I am doing wrong and how I should set the values for each dictionary?
Your code has two for loops, one that reads ERCC.txt into a dict, and
one that reads blast.txt into a dict. The first assigns to
`transcript`, the second to `blasttranscript`. When the loops are
finished, you're using the _last_ value set for both `transcript` and
`blasttranscript`. So, really, you want _three_ loops: two to load the
files into dicts, then another to compare the two of them. If the
transcripts in blast.txt are guaranteed to be a subset of ERCC.txt,
then you could get away with two loops:
# convenience function for splitting lines into values
def get_transcript_and_size(line):
columns = line.strip().split()
return columns[0].strip(), int(columns[1].strip())
# read in blast_file
blast_transcripts = {}
with open('transcript_blast.txt') as blast_file:
# this is a context manager, it'll close the file when it's
finished
for line in blast_file:
blasttranscript, blastsize = get_transcript_and_size(line)
blast_transcripts[blasttranscript] = blastsize
# read in ERCC and compare to blast
with open('transcript_ERCC.txt') as ercc_file, \
open('Not_sequenced_ERCC_transcript.txt', 'w') as
unknown_transcript, \
open('transcript_out.txt', 'w') as out_file:
# this is called a _nested_ context manager, and requires 2.7+
or 3.1+
for line in ercc_file:
ercctranscript, erccsize = get_transcript_and_size(line)
if ercctranscript not in blast_transcripts:
print >> unknown_transcript, ercctranscript
else:
is_ninety_percent = blast_transcripts[ercctranscript]
>= 0.9*erccsize
print >> out_file, ercctranscript, is_ninety_percent
I've cleaned up your code a bit, using more similar naming schemes and
the same open/write procedures for all file access. Generally, any
time you're repeating code, you should stick it into a function and
use that instead, like the `get_transcript_and_size` func. If the
columns in your two files are separated by tabs, or always by the same
number of spaces, you can simplify this even further by using the csv
module: http://docs.python.org/2/library/csv.html
Hope this helps.
More information about the Python-list
mailing list