[Tutor] regular expression query

Sat Jun 8 08:27:28 EDT 2019

Hello all,

Windows 10 OS, Python 3.6

I have a couple of  queries  in relation to extracting content using regular
expressions. I understand the pattern chars (.?*+), Meta-chars \d, \D, \W,
\W and so on. The class structure [.]. The group I believe I understand (.).
The repeat feature {m,n}. the difference between the methods match, search,
findall, sub and ETC. The challenge I am finding is getting a pattern to
extract specific word(s). Trying to identify the best method to use and how
to use the \1 when using forward and backward search pattern (Hoping I am
using the right term). Basically I am trying to extract specific phrases or
digits to place in a dictionary within categories. Thus if "ROYaL_BANK
123123123" is found, it is placed in a category called transfer funds. Other
might be a store name which likewise is placed in the store category. 

Note, I have found a logic error with "ROYAL_BANK 123123123", but that isn't
a concern. The extraction of the text is.

Line examples:

Royal_bank M-BANKING PAYMENT TRANSFER 123456 to 9922992299

Royal_bank M-BANKING PAYMENT TRANSFER 123456 FROM 9922992299

PAYMENT TO SARWARS-123123123

ROYAL_BANK INTERNET BANKING BPAY Kangaroo Store {123123123}    result =
re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ',
'ROYAL_BANK ', line)

r'ROYAL_BANK INTERNET BANKING TRANSFER Mouth in foot

EFTPOS Amazon

PAY/SALARY FROM foo bar 123123123

PAYMENT TO Tax Man  666

And other similar structures. Below is the function I am currently using.
Not sure if the sub, match or search is going to be the best method. The
reason why I am using a sub is to delete the unwanted text. The
searchmatch/findall  could do the same if I use a group. Also I have not
used any tests in the below and logically I think I should. As the code will
override the results if not found in the later tests. If there is a more
elegant  way to do it then having:

If line.startswith('text string to match'):

    Regular expression 

el If line.startswith('text string to match'):

    regular expression

return result 

I would like to know. The different regular expressions I have used are:

# this sometimes matches and sometimes does not. I want all the text up to
the from or to, to be replaced with "ROYAL_BANK". Ending up with ROYAL_BANK
123123123

    result= re.sub(r'ROYAL_BANK M-BANKING PAYMENT TRANSFER \d+ (TO|FROM) ',
'ROYAL_BANK ', line)

# the below  returns from STARWARS and it shouldn't. I should just get
STARWARS.

    result = re.match(r'PAYMENT TO (SARWARS)-\d+ ', line)

# the below should (doesn't work the last time I tested it) should return
the words between the (.)

    result = re.match(r'ROYAL_BANK INTERNET BANKING BPAY (.*) [{].*$', '\1',
line)

# the below patterns should remove the text at the beginning of the string

    result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO
', 'ROYAL_BANK ', line)

    result = re.sub(r'ROYAL_BANK INTERNET BANKING TRANSFER ', '', line)

    result = re.sub(r'EFTPOS ', '', line)

# The below does not work and I am trying to use the back or forward search
feature. Is this syntax wrong or the pattern wrong? I cannot work it out
from the information I have read.

     result = re.sub(r'PAY/SALARY FROM (*.) \d+$', '\1', line)

    result = re.sub(r'PAYMENT TO (*.) \d+', '\1', line)

Sean