[Tutor] regular expression query

Sun Jun 9 19:35:28 EDT 2019

On 08Jun2019 22:27, Sean Murphy <mhysnm1964 at gmail.com> wrote:
>Windows 10 OS, Python 3.6

Thanks for this.

>I have a couple of  queries  in relation to extracting content using 
>regular expressions. I understand [...the regexp syntax...]
>The challenge I am finding is getting a pattern to
>extract specific word(s). Trying to identify the best method to use and how
>to use the \1 when using forward and backward search pattern (Hoping I am
>using the right term). Basically I am trying to extract specific phrases or
>digits to place in a dictionary within categories. Thus if "ROYaL_BANK
>123123123" is found, it is placed in a category called transfer funds. Other
>might be a store name which likewise is placed in the store category.

I'll tackle your specific examples lower down, and make some 
suggestions.

>Note, I have found a logic error with "ROYAL_BANK 123123123", but that 
>isn't a concern. The extraction of the text is.
>
>Line examples:
>Royal_bank M-BANKING PAYMENT TRANSFER 123456 to 9922992299
>Royal_bank M-BANKING PAYMENT TRANSFER 123456 FROM 9922992299
>PAYMENT TO SARWARS-123123123
>ROYAL_BANK INTERNET BANKING BPAY Kangaroo Store {123123123}
>EFTPOS Amazon
>PAY/SALARY FROM foo bar 123123123
>PAYMENT TO Tax Man  666

Thanks.

Assuming the below is a cut/paste accident from some code:

  result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ', 'ROYAL_BANK ', line)
  r'ROYAL_BANK INTERNET BANKING TRANSFER Mouth in foot

>And other similar structures. Below is the function I am currently using.
>Not sure if the sub, match or search is going to be the best method. The
>reason why I am using a sub is to delete the unwanted text. The
>searchmatch/findall  could do the same if I use a group. Also I have not
>used any tests in the below and logically I think I should. As the code will
>override the results if not found in the later tests. If there is a more
>elegant  way to do it then having:
>
>If line.startswith('text string to match'):
>    Regular expression
>el If line.startswith('text string to match'):
>    regular expression
>return result

There is. How far you take it depends on how variable your input it.  
Banking statement data I would expect to have relatively few formats 
(unless the banking/financ industry is every bit as fragmented as I 
sometimes believe, in which case the structure might be less driven by 
_your_ bank and instead arbitrarily garbled according the the various 
other entities due to getting ad hoc junk as the description).

>I would like to know. The different regular expressions I have used 
>are:
>
># this sometimes matches and sometimes does not. I want all the text up to
>the from or to, to be replaced with "ROYAL_BANK". Ending up with ROYAL_BANK
>123123123
>
>    result= re.sub(r'ROYAL_BANK M-BANKING PAYMENT TRANSFER \d+ (TO|FROM) ',
>'ROYAL_BANK ', line)

Looks superficially ok. Got an example input line where it fails? Not 
that the above is case sentitive, so if "to" etc can be in lower case 
(as in your example text earlier) this will fail. See the re.I modifier.

># the below  returns from STARWARS and it shouldn't. I should just get
>STARWARS.
>
>    result = re.match(r'PAYMENT TO (SARWARS)-\d+ ', line)

Well, STARWARS seems misseplt above. And you should get a "match" 
object, with "STARWARS" in .group(1).

So earlier you're getting a str in result, and here you're getting an 
re.match object (or None for a failed match).

># the below should (doesn't work the last time I tested it) should 
>return the words between the (.)
>
>    result = re.match(r'ROYAL_BANK INTERNET BANKING BPAY (.*) [{].*$', '\1', line)

"should" what? It would help to see the input line you expect this to 
match. And re.match is not an re.sub - it looks like you have these 
confused here, based on the following '\`',line parameters.

># the below patterns should remove the text at the beginning of the string
>    result = re.sub(r'ROYAL_BANK INTERNET BANKING FUNDS TFER TRANSFER \d+ TO ', 'ROYAL_BANK ', line)
>    result = re.sub(r'ROYAL_BANK INTERNET BANKING TRANSFER ', '', line)
>    result = re.sub(r'EFTPOS ', '', line)

Sure. Got an example line where this does not happen?

># The below does not work and I am trying to use the back or forward 
>search feature. Is this syntax wrong or the pattern wrong? I cannot work it out
>from the information I have read.
>
>     result = re.sub(r'PAY/SALARY FROM (*.) \d+$', '\1', line)
>    result = re.sub(r'PAYMENT TO (*.) \d+', '\1', line)

You've got "*." You probably mean ".*"

Main issues:

1: Your input data seems to be mixed case, but all your regexps are case 
sensitive. They will not match if the case is different eg "Royal_Bank" 
vs "ROYAL_BANK", "to" vs "TO", etc. Use the re.I modified to make your 
regexps case insensitive.

2: You're using re.sub a lot. I'd be inclined to always use re.match and 
to pull information from the match object you get back. Untested example 
sketch:

  m = re.match('(ROYAL_BANK|COMMONER_CREDIT_UNION) INTERNET BANKING FUNDS TFER TRANSFER (\d+) TO (.*)', line)
  if m:
    category = m.match(1)
    id_number = m.match(2)
    recipient = m.match(3)
  else:
    m = re.match(.......)
    ... more tests here ...
    ...
    ...
    else:
      ... report unmatched line for further consideration ...

3: You use ".*" a lot. This is quite prone to matching too much. You 
might find things like "\S+" better, which matches a single 
nonwhitespace "word". It depends a bit on your input.

Cheers,
Cameron Simpson <cs at cskk.id.au>