Regex to extract multiple fields in the same line

MRAB python at mrabarnett.plus.com
Wed Jun 13 15:00:27 EDT 2018


On 2018-06-13 18:32, Ganesh Pal wrote:
> On Wed, Jun 13, 2018 at 5:59 PM, Rhodri James <rhodri at kynesim.co.uk> wrote:
> 
>> On 13/06/18 09:08, Ganesh Pal wrote:
>>
>>>   Hi Team,
>>>
>>> I wanted to parse a file and extract few feilds that are present after "="
>>> in a text file .
>>>
>>>
>>> Example , form  the below line I need to extract the values present after
>>> --struct =, --loc=, --size= and --log_file=
>>>
>>> Sample input
>>>
>>> line = '06/12/2018 11:13:23 AM python toolname.py  --struct=data_block
>>> --log_file=/var/1000111/test18.log --addr=None --loc=0 --mirror=10
>>> --path=/tmp/data_block.txt size=8'
>>>
>>
>> Did you mean "--size=8" at the end?  That's what your explanation implied.
> 
> Yes James you got it right ,  I  meant  "--size=8 " .,
> 
> Hi Team,
> 
> I played further with python's re.findall()  and  I am able to extract all
> the required  fields , I have 2 further questions too , please suggest
> 
> Question 1:
> 
>   Please let me know  the mistakes in the below code and  suggest if it  can
> be optimized further with better regex
> 
> 
> # This code has to extract various the fields  from a single line (
> assuming the line is matched here ) of a log file that contains various
> values (and then store the extracted values in a dictionary )
> 
> import re
> 
> line = '06/12/2018 11:13:23 AM python toolname.py  --struct=data_block
> --log_file=/var/1000111/test18.log --addr=None --loc=0 --mirror=10
> --path=/tmp/data_block.txt --size=8'
> 
> #loc is an number
> r_loc = r"--loc=([0-9]+)"
> r_size = r'--size=([0-9]+)'
> r_struct = r'--struct=([A-Za-z_]+)'
> r_log_file = r'--log_file=([A-Za-z0-9_/.]+)'
> 
> 
Here you're searching for each match _twice_:

> if re.findall(r_loc, line):
>     print re.findall(r_loc, line)
> 
> if re.findall(r_size, line):
>     print re.findall(r_size, line)
> 
> if re.findall(r_struct, line):
>     print re.findall(r_struct, line)
> 
> if re.findall(r_log_file, line):
>     print re.findall(r_log_file, line)
> 
> 
> o/p:
> root at X1:/Play_ground/SPECIAL_TYPES/REGEX# python regex_002.py
> ['0']
> ['8']
> ['data_block']
> ['/var/1000111/test18.log']
> 
> 
> Question 2:
> 
> I  tried to see if I can use  re.search with look behind assertion , it
> seems to work , any comments or suggestions
> 
> Example:
> 
> import re
> 
> line = '06/12/2018 11:13:23 AM python toolname.py  --struct=data_block
> --log_file=/var/1000111/test18.log --addr=None --loc=0 --mirror=10
> --path=/tmp/data_block.txt --size=8'
> 
> match = re.search(r'(?P<loc>(?<=--loc=)([0-9]+))', line)
> if match:
>     print match.group('loc')
> 
> 
> o/p: root at X1:/Play_ground/SPECIAL_TYPES/REGEX# python regex_002.py
> 
> 0
> 
> 
> I  want to build  the sub patterns and use match.group() to get the values
> , some thing as show below but it doesn't seem to work
> 
> 
> match = re.search(r'(?P<loc>(?<=--loc=)([0-9]+))'
>                    r'(?P<size>(?<=--size=)([0-9]+))', line)
> if match:
>     print match.group('loc')
>     print match.group('size')
> 
You can combine them into a single findall:

>>> captures = re.findall(r'--(loc=[0-9]+)|--(size=[0-9]+)|--(struct=[A-Za-z_]+)|--(log_file=[A-Za-z0-9_/.]+)', line)
>>> captures
[('', '', 'struct=data_block', ''), ('', '', '', 
'log_file=/var/1000111/test18.log'), ('loc=0', '', '', ''), ('', 
'size=8', '', '')]

In each tuple of the list, there's only one match, the others are empty, 
so get rid of the empty ones:

>>> [c for cap in captures for c in cap if c]

Split each of the matches on the first '=':

>>> [c.split('=', 1) for cap in captures for c in cap if c]

For ease of use, pass the key/value pairs into dict:

>>> info = dict(c.split('=', 1) for cap in captures for c in cap if c)
>>> info
{'struct': 'data_block', 'log_file': '/var/1000111/test18.log', 'loc': 
'0', 'size': '8'}



More information about the Python-list mailing list