[Tutor] Extract element of dtype object
Stephen P. Molnar
s.molnar at sbcglobal.net
Mon Dec 7 13:10:24 EST 2020
On 12/07/2020 12:55 PM, Dennis Lee Bieber wrote:
> o/~ Talking to myself in public o/~
>
> On Sun, 06 Dec 2020 14:33:06 -0500, Dennis Lee Bieber
> <wlfraed at ix.netcom.com> declaimed the following:
>
>
>> The best I can make out from the sketchy samples (the raw file contents
>> doesn't do much once pandas has imported it and you've created/edited the
>> resulting dataframe; seeing the dataframe as pandas displays it is more
>> useful). If your samples ARE as pandas displays them, then your input file
>> should be preprocessed as it has superfluous content that should not be
>> part of the data frame itself. Perhaps using /skiprows/ and /header/ should
>> be provided to the .read_table() call to ensure the data frame only has the
>> actual data (maybe: header=0, names=["mode", "affinity", "distance", ... ],
>> skiprows=3)
>>
>>
> Thoughts I had a few hours after that post...
>
> Is there any ICD/DDD* for the input file? That is, some document
> describing the detail format of the file, such that a new-hire programmer
> could write code to read/write that format, never having seen one before?
>
> At best, from what has been posted it appears to follow:
>
> I The file is in readable text (?what encoding: ASCII, ISO-Latin-1,
> UTF8?)
>
> II The file consists of a header section followed by data section
> A The header section consists of three lines of text
> 1 Line one consists of column labels delimited by |
> characters
> 2 Line two consists of unit notations delimited by |
> characters
> a Unit notations may be blank so long as there is a |
> delimiter placeholder
> 3 Line three consists of + and - characters
> a + characters are used to mark column boundaries
> (aligned with the | characters of previous lines)
> b - characters are used to span the gap between adjacent
> + characters
> c The line exists merely for visual esthetics when
> viewing the file in a monospaced font
>
> B The data section consists of one (is zero a viable file?) or
> more rows of numeric data delimited by TBD
> {the provided samples could be:
> 1) space filled fixed width fields: where is the width
> defined?
> 2) white-space delimited: does adjacent white-space collapse
> or does it indicate an empty field -- that is are:
> 123<tab><tab>456
> 123<tab>456
> to be treated as the same data, or does the first line
> contain three fields, the middle being empty?}
> 3) comma-separated (or some other visible delimiter -- which
> needs to be specified) {based upon the sample, this is not the case for
> this data}
>
>
> Note that the provided samples would confuse any CSV/TSV reading logic
> -- which is what I understand the pandas read logic uses internally. Many
> such readers attempt to interpret the format from the first couple of rows,
> and attempt to extract column headers from the first row. Actually,
> looking at the original post -- there is a column label that spans TWO data
> columns, so that is going to confuse things too!
>
> What such might interpret from the sample file is that the data is |
> delimited (from the first two lines), and then fail to parse the actual
> data lines, as there are no | delimiters -- the entire data line would be
> treated as a single text string in column 1.
>
>
> Given the above hypothetical understanding of the input file, I'd
>
> Manually open the input file and read the first line of the header;
> close the file.
> Split that line on | characters, and strip any remaining spaces to get
> column labels {this won't handle the spanned columns properly -- you'd have
> to append another label for the second "dist" column"}
>
>>>> h1 = "mode | affinity | dist from best mode"
>>>> h1.split("|")
> ['mode ', ' affinity ', ' dist from best mode']
>>>> [lbl.strip() for lbl in h1.split("|")]
> ['mode', 'affinity', 'dist from best mode']
> I would then pass that list of labels to the pandas read function while
> telling it to skip the first three lines of the file and to not attempt to
> parse a header. One might have to specify some other attribute to ensure
> the data rows are parsed properly if the file is not using commas or tabs
> to separate elements on each row.
>
>
>
>
>
> * Interface Control Document or Data Definition(Description) Document
>
>
Thanks for you detailed thoughts. I appreciate them.
I happen to be an Organic Chemist and Computers with some of their
languages are among the many tools that I have used in the past 60 years
or so. A a gauge of my logevity, the first langauge that I managed to
butcher was FORTRAN II. Unfortunaly, a little bit of knowledge is a very
dangerous thins and last for along time.
Thanks to the suggestions this this post has received, I came up with
this code, which seems to give me what I was after'
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Mon Dec 7 06:49:04 2020
/home/comp/Apps/PythonDevelopment/ExtractData/ExtractDockData_4.py
"""
#pylint: disable = invalid-name
file = []
with open ('test_1.log', 'rt') as myfile:
for myline in myfile:
file.append(myline)
line = file[28]
print(line)
fields = line.split()
print(fields[1])
this results in:
runfile('/home/comp/Apps/PythonDevelopment/ExtractData/ExtractDockData_4.py',
wdir='/home/comp/Apps/PythonDevelopment/ExtractData',
current_namespace=True)
1 -8.538750714 0.000 0.000
--
Stephen P. Molnar, Ph.D.
www.molecular-modeling.net
614.312.7528 (c)
Skype: smolnar1
More information about the Tutor
mailing list