convert script awk in python

Thu Mar 25 06:16:37 EDT 2021

Peter Otten <__peter__ at web.de> writes:

> On 25/03/2021 08:14, Loris Bennett wrote:
>
>> I'm not doing that, but I am trying to replace a longish bash pipeline
>> with Python code.
>>
>> Within Emacs, often I use Org mode[1] to generate date via some bash
>> commands and then visualise the data via Python.  Thus, in a single Org
>> file I run
>>
>>    /usr/bin/sacct  -u $user -o jobid -X -S $start -E $end -s COMPLETED -n  | \
>>    xargs -I {} seff {} | grep 'Efficiency' | sed '$!N;s/\n/ /' | awk '{print $3 " " $9}' | sed 's/%//g'
>>
>> The raw numbers are formatted by Org into a table
>>
>>    | cpu_eff | mem_eff |
>>    |---------+---------|
>>    |    96.6 |   99.11 |
>>    |   93.43 |   100.0 |
>>    |    91.3 |   100.0 |
>>    |   88.71 |   100.0 |
>>    |   89.79 |   100.0 |
>>    |   84.59 |   100.0 |
>>    |   83.42 |   100.0 |
>>    |   86.09 |   100.0 |
>>    |   92.31 |   100.0 |
>>    |   90.05 |   100.0 |
>>    |   81.98 |   100.0 |
>>    |   90.76 |   100.0 |
>>    |   75.36 |   64.03 |
>>
>> I then read this into some Python code in the Org file and do something like
>>
>>    df = pd.DataFrame(eff_tab[1:], columns=eff_tab[0])
>>    cpu_data = df.loc[: , "cpu_eff"]
>>    mem_data = df.loc[: , "mem_eff"]
>>
>>    ...
>>
>>    n, bins, patches = axis[0].hist(cpu_data, bins=range(0, 110, 5))
>>    n, bins, patches = axis[1].hist(mem_data, bins=range(0, 110, 5))
>>
>> which generates nice histograms.
>>
>> I decided rewrite the whole thing as a stand-alone Python program so
>> that I can run it as a cron job.  However, as a novice Python programmer
>> I am finding translating the bash part slightly clunky.  I am in the
>> middle of doing this and started with the following:
>>
>>          sacct = subprocess.Popen(["/usr/bin/sacct",
>>                                    "-u", user,
>>                                    "-S", period[0], "-E", period[1],
>>                                    "-o", "jobid", "-X",
>>                                    "-s", "COMPLETED", "-n"],
>>                                   stdout=subprocess.PIPE,
>>          )
>>
>>          jobids = []
>>
>>          for line in sacct.stdout:
>>              jobid = str(line.strip(), 'UTF-8')
>>              jobids.append(jobid)
>>
>>          for jobid in jobids:
>>              seff = subprocess.Popen(["/usr/bin/seff", jobid],
>>                                      stdin=sacct.stdout,
>>                                      stdout=subprocess.PIPE,
>>              )
>
> The statement above looks odd. If seff can read the jobids from stdin
> there should be no need to pass them individually, like:
>
> sacct = ...
> seff = Popen(
>   ["/usr/bin/seff"], stdin=sacct.stdout, stdout=subprocess.PIPE,
>   universal_newlines=True
> )
> for line in seff.communicate()[0].splitlines():
>     ...

Indeed, seff cannot read multiple jobids.  That's why had 'xargs' in the
original bash code.  Initially I thought of calling 'xargs' via
Popen, but this seemed very fiddly (I didn't manage to get it working)
and anyway seemed a bit weird to me as it is really just a loop, which I
can implement perfectly well in Python.

Cheers,

Loris

>>              seff_output = []
>>              for line in seff.stdout:
>>                  seff_output.append(str(line.strip(), "UTF-8"))
>>
>>              ...
>>
>> but compared the to the bash pipeline, this all seems a bit laboured.
>>
>> Does any one have a better approach?
>>
>> Cheers,
>>
>> Loris
>>
>>
>>> -----Original Message-----
>>> From: Cameron Simpson <cs at cskk.id.au>
>>> Sent: Wednesday, March 24, 2021 6:34 PM
>>> To: Avi Gross <avigross at verizon.net>
>>> Cc: python-list at python.org
>>> Subject: Re: convert script awk in python
>>>
>>> On 24Mar2021 12:00, Avi Gross <avigross at verizon.net> wrote:
>>>> But I wonder how much languages like AWK are still used to make new
>>>> programs as compared to a time they were really useful.
>>>
>>> You mentioned in an adjacent post that you've not used AWK since 2000.
>>> By contrast, I still use it regularly.
>>>
>>> It's great for proof of concept at the command line or in small scripts, and
>>> as the innards of quite useful scripts. I've a trite "colsum" script which
>>> does nothing but generate and run a little awk programme to sum a column,
>>> and routinely type "blah .... | colsum 2" or the like to get a tally.
>>>
>>> I totally agree that once you're processing a lot of data from places or
>>> where a shell script is making long pipelines or many command invocations,
>>> if that's a performance issue it is time to recode.
>>>
>>> Cheers,
>>> Cameron Simpson <cs at cskk.id.au>
>>
>> Footnotes:
>> [1]  https://orgmode.org/
>>
>
-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de