[Tutor] Labeling and Sorting a Test File

Fri Jul 30 12:20:19 EDT 2021

On Fri, 30 Jul 2021 10:27:26 -0400, "Stephen P. Molnar"
<s.molnar at sbcglobal.net> declaimed the following:

>First of all, let me say this is not a school exercise.
>
>I have a project that I am working on that is going to generate a large 
>number of text files, perhaps in the thousands.
>The format of these files is:
>
>Detected 8 CPUs
>Reading input ... done.
>Setting up the scoring function ... done.
>Analyzing the binding site ... done.
>Using random seed: -596016
>Performing search ... done.
>
	What application is generating files interspersing progress reports
with...

>Refining results ... done.
>
>mode |   affinity | dist from best mode
>      | (kcal/mol) | rmsd l.b.| rmsd u.b.
>-----+------------+----------+----------
>    1    -4.780168282      0.000      0.000
>    2    -4.767296818      8.730     11.993
>    3    -4.709057289     13.401     15.939
>    4    -4.677956834      8.271     11.344
>    5    -4.633563903      8.581     11.967
>    6    -4.569815226      4.730      8.233
>    7    -4.540149947      8.578     11.959
>    8    -4.515237403      8.096     10.215
>    9    -4.514233086      5.064      7.689
>Writing output ... done.

... a tabular text dump of said data?

>
>I have managed, through a combination of fruitlessly searching Google 
>and making errors, this Python code (attached):
>
>#!/usr/bin/env python3
># -*- coding: utf-8 -*-
>
>import numpy as np
>
>with open("Ligand.list") as ligands_f:
>     for line in ligands_f:
>         ligand = line.strip()
>         for num in range(1,11):
>             #print(ligand)
>             name_in = "{}.{}.log".format(ligand,num)
>             data = np.genfromtxt(name_in,skip_header=28, skip_footer=1)
>             name_s = ligand+'-BE'
>             f = open(name_s, 'a')
>             f.write(str(data[1,1])+'\n')
>             f.close()
>
	Why open and reclose the file for each write operation, since you are
going to be writing 10 lines, one per file. Also, what do you expect if the
output file already exists when you process the first file?.

with open("Ligand.list") as ligands_f:
     for line in ligands_f:
         ligand = line.strip()
	  with open(ligand+"-BE", "w") as outfil:
         	for num in range(1,11):
     	        	#print(ligand)
             	name_in = "{}.{}.log".format(ligand,num)
             	data = np.genfromtxt(name_in,skip_header=28, skip_footer=1)
             	outfil.write(str(data[1,1])+'\n')

>The result of applying the code to ten files is:
>
>-4.709057289
>-4.66850894
>-4.747875776
>-4.631865671
>-4.661709186
>-4.686041874
>-4.632855261
>-4.617575733
>-4.734570162
>-4.727217506
>
	Which is somewhat meaningless without being shown the ten input
files... Are those the first data entry, the last data entry, something
between. I'm guessing the first entry from each file.

>This is almost the way I want the file, with two problems:
>
>1. I want to the first line in the textile to be the name of the file, 
>in the example 'Test'.

	Which file? Your code has an input file of, apparently, partial file
names, then a series of files using the partial file name read from that
first file, with a series of numerics appended, and an output file name?

	Where does "Test" come from?

	As for writing it -- given my scratched rewrite of your main code --
you would write that between the "with open..." and "for num..."
statements.	
>
>2. I want the list sorted in the order of decreasing negative order:
>

	Normal ascending numeric sort...

	You can't do that unless you create the list in memory. Since you are
processing one item from each of 10 files, and immediately writing it out,
you are stuck with the order found in the files.

	This means moving the output to OUTSIDE of the main code, and
accumulating the values..

>-4.747875776
>-4.734570162
>-4.727217506
>-4.709057289
>-4.686041874
>-4.66850894
>-4.661709186
>-4.632855261
>-4.631865671
>-4.617575733
>

>As my Python programming skills are limited, at best, I would appreciate 

	Spend some time with the Python Tutorial, and the library reference
manual.

	Given you seem to be interested in only the first data line reported in
each file, I'd have skipped the entire numpy overhead (you are reading in a
whole file, just to throw all but one line away).

>>> _, affinity, _, _ = "    1    -4.780168282      0.000      0.000".split()
>>> affinity
'-4.780168282'
>>> float(affinity)
-4.780168282
>>> 

	With no massive import, just basic Python...

-=-=-
with open("Ligand.list") as ligands_f:
    for line in ligands_f:
        ligand = line.strip()
        outfn = ligand+"-BE"
        results = []
        for num in range(1,11):
            with open("%s.%s.log" % (ligand, num), "r") as infil:
                while True:
                    ln = infil.readln()
                    if ln.startswith("---"):
                        ln = infil.readln()
                        _, affinity, _, _ = ln.split()
                        #if possibility of more than 4 "words" on line
                        #words = ln.split()
                        #affinity = words[1]
                        results.append(float(affinity))
                        break

        with open(outfn, "w") as outfil:
            results.sort()
            outfil.write("%s\n" % outfn) #or whatever you really desire
            outfil.write("".join(["%s\n" % itm for itm in results]))
-=-=- 

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
	wlfraed at ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/