[Tutor] Extract element of dtype object

Mats Wichmann mats at wichmann.us
Mon Dec 7 13:18:41 EST 2020


On 12/7/20 11:10 AM, Stephen P. Molnar wrote:
> 
> 
> On 12/07/2020 12:55 PM, Dennis Lee Bieber wrote:
>>     o/~ Talking to myself in public o/~
>>
>> On Sun, 06 Dec 2020 14:33:06 -0500, Dennis Lee Bieber
>> <wlfraed at ix.netcom.com> declaimed the following:
>>
>>
>>>     The best I can make out from the sketchy samples (the raw file 
>>> contents
>>> doesn't do much once pandas has imported it and you've created/edited 
>>> the
>>> resulting dataframe; seeing the dataframe as pandas displays it is more
>>> useful). If your samples ARE as pandas displays them, then your input 
>>> file
>>> should be preprocessed as it has superfluous content that should not be
>>> part of the data frame itself. Perhaps using /skiprows/ and /header/ 
>>> should
>>> be provided to the .read_table() call to ensure the data frame only 
>>> has the
>>> actual data (maybe: header=0, names=["mode", "affinity", "distance", 
>>> ... ],
>>> skiprows=3)
>>>
>>>
>>     Thoughts I had a few hours after that post...
>>
>>     Is there any ICD/DDD* for the input file? That is, some document
>> describing the detail format of the file, such that a new-hire programmer
>> could write code to read/write that format, never having seen one before?
>>
>>     At best, from what has been posted it appears to follow:
>>
>> I        The file is in readable text (?what encoding: ASCII, 
>> ISO-Latin-1,
>> UTF8?)
>>
>> II        The file consists of a header section followed by data section
>>     A        The header section consists of three lines of text
>>         1        Line one consists of column labels delimited by |
>> characters
>>         2        Line two consists of unit notations delimited by |
>> characters
>>             a        Unit notations may be blank so long as there is a |
>> delimiter placeholder
>>         3        Line three consists of + and - characters
>>             a        + characters are used to mark column boundaries
>> (aligned with the | characters of previous lines)
>>             b        - characters are used to span the gap between 
>> adjacent
>> + characters
>>             c        The line exists merely for visual esthetics when
>> viewing the file in a monospaced font
>>
>>     B        The data section consists of one (is zero a viable file?) or
>> more rows of numeric data delimited by TBD
>>         {the provided samples could be:
>>             1)    space filled fixed width fields: where is the width
>> defined?
>>             2)    white-space delimited: does adjacent white-space 
>> collapse
>> or does it indicate an empty field -- that is are:
>>                 123<tab><tab>456
>>                 123<tab>456
>>                 to be treated as the same data, or does the first line
>> contain three fields, the middle being empty?}
>>             3)    comma-separated (or some other visible delimiter -- 
>> which
>> needs to be specified) {based upon the sample, this is not the case for
>> this data}
>>
>>
>>     Note that the provided samples would confuse any CSV/TSV reading 
>> logic
>> -- which is what I understand the pandas read logic uses internally. Many
>> such readers attempt to interpret the format from the first couple of 
>> rows,
>> and attempt to extract column headers from the first row.  Actually,
>> looking at the original post -- there is a column label that spans TWO 
>> data
>> columns, so that is going to confuse things too!
>>
>>     What such might interpret from the sample file is that the data is |
>> delimited (from the first two lines), and then fail to parse the actual
>> data lines, as there are no | delimiters -- the entire data line would be
>> treated as a single text string in column 1.
>>
>>
>>     Given the above hypothetical understanding of the input file, I'd
>>
>>     Manually open the input file and read the first line of the header;
>> close the file.
>>     Split that line on | characters, and strip any remaining spaces to 
>> get
>> column labels {this won't handle the spanned columns properly -- you'd 
>> have
>> to append another label for the second "dist" column"}
>>
>>>>> h1 = "mode |   affinity | dist from best mode"
>>>>> h1.split("|")
>> ['mode ', '   affinity ', ' dist from best mode']
>>>>> [lbl.strip() for lbl in h1.split("|")]
>> ['mode', 'affinity', 'dist from best mode']
>>     I would then pass that list of labels to the pandas read function 
>> while
>> telling it to skip the first three lines of the file and to not 
>> attempt to
>> parse a header. One might have to specify some other attribute to ensure
>> the data rows are parsed properly if the file is not using commas or tabs
>> to separate elements on each row.
>>
>>
>>
>>
>>
>> * Interface Control Document or Data Definition(Description) Document
>>
>>
> Thanks for you detailed thoughts. I appreciate them.
> 
> I happen to be an Organic Chemist and Computers with some of their 
> languages are among the many tools that I have used in the past 60 years 
> or so. A a gauge of my logevity, the first langauge that I managed to 
> butcher was FORTRAN II. Unfortunaly, a little bit of knowledge is a very 
> dangerous thins and last for along time.

me too :) fortunately, in my case, I appear to have kept FORTRAN II (and 
later Pascal) from polluting my mind.

> line = file[28]

Just as a general comment, Magic Numbers like 28 aren't very desirable. 
To the reader (including oneself at any time more than about two days 
later): "Wait... 28? What does that mean?". At the very least, give it a 
name.  But better is to find a pattern to look for rather than a 
specific numeric index, as that makes it more resilient to anything 
maybe not quite fitting the original model at some later time.




More information about the Tutor mailing list