[Tutor] regular expression query
mhysnm1964 at gmail.com
mhysnm1964 at gmail.com
Sat Jun 8 08:27:28 EDT 2019
Hello all,
Windows 10 OS, Python 3.6
I have a couple of queries in relation to extracting content using regular
expressions. I understand the pattern chars (.?*+), Meta-chars \d, \D, \W,
\W and so on. The class structure [.]. The group I believe I understand (.).
The repeat feature {m,n}. the difference between the methods match, search,
findall, sub and ETC. The challenge I am finding is getting a pattern to
extract specific word(s). Trying to identify the best method to use and how
to use the \1 when using forward and backward search pattern (Hoping I am
using the right term). Basically I am trying to extract specific phrases or
digits to place in a dictionary within categories. Thus if "ROYaL_BANK
123123123" is found, it is placed in a category called transfer funds. Other
might be a store name which likewise is placed in the store category.
Note, I have found a logic error with "ROYAL_BANK 123123123", but that isn't
a concern. The extraction of the text is.
Line examples:
Royal_bank M-BANKING PAYMENT TRANSFER 123456 to 9922992299
Royal_bank M-BANKING PAYMENT TRANSFER 123456 FROM 9922992299
PAYMENT TO SARWARS-123123123
ROYAL_BANK INTERNET BANKING BPAY Kangaroo Store {123123123} result =
re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ',
'ROYAL_BANK ', line)
r'ROYAL_BANK INTERNET BANKING TRANSFER Mouth in foot
EFTPOS Amazon
PAY/SALARY FROM foo bar 123123123
PAYMENT TO Tax Man 666
And other similar structures. Below is the function I am currently using.
Not sure if the sub, match or search is going to be the best method. The
reason why I am using a sub is to delete the unwanted text. The
searchmatch/findall could do the same if I use a group. Also I have not
used any tests in the below and logically I think I should. As the code will
override the results if not found in the later tests. If there is a more
elegant way to do it then having:
If line.startswith('text string to match'):
Regular expression
el If line.startswith('text string to match'):
regular expression
return result
I would like to know. The different regular expressions I have used are:
# this sometimes matches and sometimes does not. I want all the text up to
the from or to, to be replaced with "ROYAL_BANK". Ending up with ROYAL_BANK
123123123
result= re.sub(r'ROYAL_BANK M-BANKING PAYMENT TRANSFER \d+ (TO|FROM) ',
'ROYAL_BANK ', line)
# the below returns from STARWARS and it shouldn't. I should just get
STARWARS.
result = re.match(r'PAYMENT TO (SARWARS)-\d+ ', line)
# the below should (doesn't work the last time I tested it) should return
the words between the (.)
result = re.match(r'ROYAL_BANK INTERNET BANKING BPAY (.*) [{].*$', '\1',
line)
# the below patterns should remove the text at the beginning of the string
result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO
', 'ROYAL_BANK ', line)
result = re.sub(r'ROYAL_BANK INTERNET BANKING TRANSFER ', '', line)
result = re.sub(r'EFTPOS ', '', line)
# The below does not work and I am trying to use the back or forward search
feature. Is this syntax wrong or the pattern wrong? I cannot work it out
from the information I have read.
result = re.sub(r'PAY/SALARY FROM (*.) \d+$', '\1', line)
result = re.sub(r'PAYMENT TO (*.) \d+', '\1', line)
Sean
More information about the Tutor
mailing list