[Tutor] Extract element of dtype object

Stephen P. Molnar s.molnar at sbcglobal.net
Mon Dec 7 13:10:24 EST 2020



On 12/07/2020 12:55 PM, Dennis Lee Bieber wrote:
> 	o/~ Talking to myself in public o/~
>
> On Sun, 06 Dec 2020 14:33:06 -0500, Dennis Lee Bieber
> <wlfraed at ix.netcom.com> declaimed the following:
>
>
>> 	The best I can make out from the sketchy samples (the raw file contents
>> doesn't do much once pandas has imported it and you've created/edited the
>> resulting dataframe; seeing the dataframe as pandas displays it is more
>> useful). If your samples ARE as pandas displays them, then your input file
>> should be preprocessed as it has superfluous content that should not be
>> part of the data frame itself. Perhaps using /skiprows/ and /header/ should
>> be provided to the .read_table() call to ensure the data frame only has the
>> actual data (maybe: header=0, names=["mode", "affinity", "distance", ... ],
>> skiprows=3)
>>
>> 	
> 	Thoughts I had a few hours after that post...
>
> 	Is there any ICD/DDD* for the input file? That is, some document
> describing the detail format of the file, such that a new-hire programmer
> could write code to read/write that format, never having seen one before?
>
> 	At best, from what has been posted it appears to follow:
>
> I		The file is in readable text (?what encoding: ASCII, ISO-Latin-1,
> UTF8?)
>
> II		The file consists of a header section followed by data section
> 	A		The header section consists of three lines of text
> 		1		Line one consists of column labels delimited by |
> characters
> 		2		Line two consists of unit notations delimited by |
> characters
> 			a		Unit notations may be blank so long as there is a |
> delimiter placeholder
> 		3		Line three consists of + and - characters
> 			a		+ characters are used to mark column boundaries
> (aligned with the | characters of previous lines)
> 			b		- characters are used to span the gap between adjacent
> + characters
> 			c		The line exists merely for visual esthetics when
> viewing the file in a monospaced font
>
> 	B		The data section consists of one (is zero a viable file?) or
> more rows of numeric data delimited by TBD
> 		{the provided samples could be:
> 			1)	space filled fixed width fields: where is the width
> defined?
> 			2)	white-space delimited: does adjacent white-space collapse
> or does it indicate an empty field -- that is are:
> 				123<tab><tab>456
> 				123<tab>456
> 				to be treated as the same data, or does the first line
> contain three fields, the middle being empty?}
> 			3)	comma-separated (or some other visible delimiter -- which
> needs to be specified) {based upon the sample, this is not the case for
> this data}
>
>
> 	Note that the provided samples would confuse any CSV/TSV reading logic
> -- which is what I understand the pandas read logic uses internally. Many
> such readers attempt to interpret the format from the first couple of rows,
> and attempt to extract column headers from the first row.  Actually,
> looking at the original post -- there is a column label that spans TWO data
> columns, so that is going to confuse things too!
>
> 	What such might interpret from the sample file is that the data is |
> delimited (from the first two lines), and then fail to parse the actual
> data lines, as there are no | delimiters -- the entire data line would be
> treated as a single text string in column 1.
>
>
> 	Given the above hypothetical understanding of the input file, I'd
>
> 	Manually open the input file and read the first line of the header;
> close the file.
> 	Split that line on | characters, and strip any remaining spaces to get
> column labels {this won't handle the spanned columns properly -- you'd have
> to append another label for the second "dist" column"}
>
>>>> h1 = "mode |   affinity | dist from best mode"
>>>> h1.split("|")
> ['mode ', '   affinity ', ' dist from best mode']
>>>> [lbl.strip() for lbl in h1.split("|")]
> ['mode', 'affinity', 'dist from best mode']
> 	I would then pass that list of labels to the pandas read function while
> telling it to skip the first three lines of the file and to not attempt to
> parse a header. One might have to specify some other attribute to ensure
> the data rows are parsed properly if the file is not using commas or tabs
> to separate elements on each row.
>
>
>
>
>
> * Interface Control Document or Data Definition(Description) Document
>
>
Thanks for you detailed thoughts. I appreciate them.

I happen to be an Organic Chemist and Computers with some of their 
languages are among the many tools that I have used in the past 60 years 
or so. A a gauge of my logevity, the first langauge that I managed to 
butcher was FORTRAN II. Unfortunaly, a little bit of knowledge is a very 
dangerous thins and last for along time.

Thanks to the suggestions this this post has received, I came up with 
this code, which seems to give me what I was after'

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Mon Dec  7 06:49:04 2020

/home/comp/Apps/PythonDevelopment/ExtractData/ExtractDockData_4.py
"""
#pylint: disable = invalid-name
file = []
with open ('test_1.log', 'rt') as myfile:
     for myline in myfile:
         file.append(myline)
line = file[28]
print(line)
fields = line.split()
print(fields[1])

this results in:

runfile('/home/comp/Apps/PythonDevelopment/ExtractData/ExtractDockData_4.py', 
wdir='/home/comp/Apps/PythonDevelopment/ExtractData', 
current_namespace=True)
    1    -8.538750714      0.000      0.000

-- 
Stephen P. Molnar, Ph.D.
www.molecular-modeling.net
614.312.7528 (c)
Skype:  smolnar1



More information about the Tutor mailing list