pattern block expression matching

Peter Otten __peter__ at web.de
Sat Jul 21 13:20:01 EDT 2018


MRAB wrote:

> On 2018-07-21 15:20, aldi.kraja at gmail.com wrote:
>> Hi,
>> I have a long text, which tells me which files from a database were
>> downloaded and which ones failed. The pattern is as follows (at the end
>> of this post). Wrote a tiny program, but still is raw. I want to find
>> term "ERROR" and go 5 lines above and get the name with suffix XPT, in
>> this first case DRXIFF_F.XPT, but it changes in other cases to some other
>> name with suffix XPT. Thanks, Aldi
>> 
>> # reading errors from a file txt
>> import re
>> with open('nohup.out', 'r') as fh:
>>    lines = fh.readlines()
>>    for line in lines:
>>        m1 = re.search("XPT", line)
>>        m2 = re.search('ERROR', line)
>>        if m1:
>>          print(line)
>>        if m2:
>>          print(line)
>> 
> Firstly, you don't need regex for something has simple has checking for
> the presence of a string.
> 
> Secondly, I think it's 4 lines above, not 5.
> 
> 'enumerate' comes in useful here:
> 
> with open('nohup.out', 'r') as fh:
>      lines = fh.readlines()
>      for i, line in enumerate(lines):
>          if 'ERROR' in line:
>              print(line)
>              print(lines[i - 4])

Here's an alternative that works when the file is huge, and reading it into 
memory is impractical:

import itertools

def get_url(line):
    return line.rsplit(None, 1)[-1]

def pairs(lines, step=4):
    a, b = itertools.tee(f)
    return zip(a, itertools.islice(b, step, None))

with open("nohup.out") as f:
    for s, t in pairs(f, 4):
        if "ERROR" in t:
            assert "XPT" in s
            print(get_url(s))

And here's yet another way that assumes that 

(1) the groups are separated by empty lines
(2) the first line always contains the file name
(3) "ERROR" may occur in any of the lines that follow

 def groups(lines):
    return (
        group
        for key, group in itertools.groupby(lines, key=str.isspace)
        if not key
    )

with open("nohup.out") as f:
    for group in groups(f):
        first = next(group)
        if any("ERROR" in line for line in group):
            assert "XPT" in first
            print(get_url(first))

 
>> --2018-07-14 21:26:45-- 
>> https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DRXIFF_F.XPT Resolving
>> wwwn.cdc.gov (wwwn.cdc.gov)... 198.246.102.39 Connecting to wwwn.cdc.gov
>> (wwwn.cdc.gov)|198.246.102.39|:443... connected. HTTP request sent,
>> awaiting response... 404 Not Found 2018-07-14 21:26:46 ERROR 404: Not
>> Found.
>> 
>> --2018-07-14 21:26:46-- 
>> https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DRXTOT_F.XPT Resolving
>> wwwn.cdc.gov (wwwn.cdc.gov)... 198.246.102.39 Connecting to wwwn.cdc.gov
>> (wwwn.cdc.gov)|198.246.102.39|:443... connected. HTTP request sent,
>> awaiting response... 404 Not Found 2018-07-14 21:26:46 ERROR 404: Not
>> Found.
>> 
>> --2018-07-14 21:26:46-- 
>> https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DRXFMT_F.XPT Resolving
>> wwwn.cdc.gov (wwwn.cdc.gov)... 198.246.102.39 Connecting to wwwn.cdc.gov
>> (wwwn.cdc.gov)|198.246.102.39|:443... connected. HTTP request sent,
>> awaiting response... 404 Not Found 2018-07-14 21:26:46 ERROR 404: Not
>> Found.
>> 
>> --2018-07-14 21:26:46-- 
>> https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DSQ1_F.XPT Resolving
>> wwwn.cdc.gov (wwwn.cdc.gov)... 198.246.102.39 Connecting to wwwn.cdc.gov
>> (wwwn.cdc.gov)|198.246.102.39|:443... connected. HTTP request sent,
>> awaiting response... 404 Not Found 2018-07-14 21:26:47 ERROR 404: Not
>> Found.
>> 
>> --2018-07-14 21:26:47-- 
>> https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DSII.XPT Resolving
>> wwwn.cdc.gov (wwwn.cdc.gov)... 198.246.102.39 Connecting to wwwn.cdc.gov
>> (wwwn.cdc.gov)|198.246.102.39|:443... connected. HTTP request sent,
>> awaiting response... 200 OK Length: 56060880 (53M)
>> [application/octet-stream] Saving to: ‘DSII.XPT’
>> 
> 





More information about the Python-list mailing list