[Tutor] Extract element of dtype object
Mats Wichmann
mats at wichmann.us
Mon Dec 7 13:18:41 EST 2020
On 12/7/20 11:10 AM, Stephen P. Molnar wrote:
>
>
> On 12/07/2020 12:55 PM, Dennis Lee Bieber wrote:
>> o/~ Talking to myself in public o/~
>>
>> On Sun, 06 Dec 2020 14:33:06 -0500, Dennis Lee Bieber
>> <wlfraed at ix.netcom.com> declaimed the following:
>>
>>
>>> The best I can make out from the sketchy samples (the raw file
>>> contents
>>> doesn't do much once pandas has imported it and you've created/edited
>>> the
>>> resulting dataframe; seeing the dataframe as pandas displays it is more
>>> useful). If your samples ARE as pandas displays them, then your input
>>> file
>>> should be preprocessed as it has superfluous content that should not be
>>> part of the data frame itself. Perhaps using /skiprows/ and /header/
>>> should
>>> be provided to the .read_table() call to ensure the data frame only
>>> has the
>>> actual data (maybe: header=0, names=["mode", "affinity", "distance",
>>> ... ],
>>> skiprows=3)
>>>
>>>
>> Thoughts I had a few hours after that post...
>>
>> Is there any ICD/DDD* for the input file? That is, some document
>> describing the detail format of the file, such that a new-hire programmer
>> could write code to read/write that format, never having seen one before?
>>
>> At best, from what has been posted it appears to follow:
>>
>> I The file is in readable text (?what encoding: ASCII,
>> ISO-Latin-1,
>> UTF8?)
>>
>> II The file consists of a header section followed by data section
>> A The header section consists of three lines of text
>> 1 Line one consists of column labels delimited by |
>> characters
>> 2 Line two consists of unit notations delimited by |
>> characters
>> a Unit notations may be blank so long as there is a |
>> delimiter placeholder
>> 3 Line three consists of + and - characters
>> a + characters are used to mark column boundaries
>> (aligned with the | characters of previous lines)
>> b - characters are used to span the gap between
>> adjacent
>> + characters
>> c The line exists merely for visual esthetics when
>> viewing the file in a monospaced font
>>
>> B The data section consists of one (is zero a viable file?) or
>> more rows of numeric data delimited by TBD
>> {the provided samples could be:
>> 1) space filled fixed width fields: where is the width
>> defined?
>> 2) white-space delimited: does adjacent white-space
>> collapse
>> or does it indicate an empty field -- that is are:
>> 123<tab><tab>456
>> 123<tab>456
>> to be treated as the same data, or does the first line
>> contain three fields, the middle being empty?}
>> 3) comma-separated (or some other visible delimiter --
>> which
>> needs to be specified) {based upon the sample, this is not the case for
>> this data}
>>
>>
>> Note that the provided samples would confuse any CSV/TSV reading
>> logic
>> -- which is what I understand the pandas read logic uses internally. Many
>> such readers attempt to interpret the format from the first couple of
>> rows,
>> and attempt to extract column headers from the first row. Actually,
>> looking at the original post -- there is a column label that spans TWO
>> data
>> columns, so that is going to confuse things too!
>>
>> What such might interpret from the sample file is that the data is |
>> delimited (from the first two lines), and then fail to parse the actual
>> data lines, as there are no | delimiters -- the entire data line would be
>> treated as a single text string in column 1.
>>
>>
>> Given the above hypothetical understanding of the input file, I'd
>>
>> Manually open the input file and read the first line of the header;
>> close the file.
>> Split that line on | characters, and strip any remaining spaces to
>> get
>> column labels {this won't handle the spanned columns properly -- you'd
>> have
>> to append another label for the second "dist" column"}
>>
>>>>> h1 = "mode | affinity | dist from best mode"
>>>>> h1.split("|")
>> ['mode ', ' affinity ', ' dist from best mode']
>>>>> [lbl.strip() for lbl in h1.split("|")]
>> ['mode', 'affinity', 'dist from best mode']
>> I would then pass that list of labels to the pandas read function
>> while
>> telling it to skip the first three lines of the file and to not
>> attempt to
>> parse a header. One might have to specify some other attribute to ensure
>> the data rows are parsed properly if the file is not using commas or tabs
>> to separate elements on each row.
>>
>>
>>
>>
>>
>> * Interface Control Document or Data Definition(Description) Document
>>
>>
> Thanks for you detailed thoughts. I appreciate them.
>
> I happen to be an Organic Chemist and Computers with some of their
> languages are among the many tools that I have used in the past 60 years
> or so. A a gauge of my logevity, the first langauge that I managed to
> butcher was FORTRAN II. Unfortunaly, a little bit of knowledge is a very
> dangerous thins and last for along time.
me too :) fortunately, in my case, I appear to have kept FORTRAN II (and
later Pascal) from polluting my mind.
> line = file[28]
Just as a general comment, Magic Numbers like 28 aren't very desirable.
To the reader (including oneself at any time more than about two days
later): "Wait... 28? What does that mean?". At the very least, give it a
name. But better is to find a pattern to look for rather than a
specific numeric index, as that makes it more resilient to anything
maybe not quite fitting the original model at some later time.
More information about the Tutor
mailing list