Extracting parts of string between anchor points

Thu Feb 27 16:45:35 EST 2014

On 2014-02-27 20:07, Jignesh Sutar wrote:
> I've kind of got this working but my code is very ugly. I'm sure
> it's regular expression I need to achieve this more but not very
> familiar with use regex, particularly retaining part of the string
> that is being searched/matched for.

While I suppose this could be done with regular expressions, in this
case it seems like a bit of overkill.  Just to show that it can be
done, I give you this monstrosity:

>>> examples = ["Test1A",
... "Test2A: Test2B",
... "Test3A: Test3B -:- Test3C",
... ""]
>>> import re
>>> r = re.compile(r"^([^:]*)(?::((?:(?!-:-).)*)(?:-:-(.*))?)?")
>>> [r.match(s).groups() for s in examples]
[('Test1A', None, None), ('Test2A', ' Test2B', None), ('Test3A', '
Test3B ', ' Test3C'), ('', None, None)]

You'd still have to strip those values that are strings, but that
gets you the core of what you're seeking.

You do omit several edge cases:

  to_test = [
    "Test4A  -:- Test4D",                    # no ":"
    "Test4A : Test4B : Test4C -:- Test4D",   # 2x ":"
    "Test4A : Test4B -:- Test4C -:- Test4D", # 2x "-:-"
    ]

what should Out2 and Out3 be in those particular instances?

> Notes and code below to demonstrate what I am trying to achieve.
> Any help, much appreciated.
> 
> Examples=["Test1A",
>                   "Test2A: Test2B",
>                    "Test3A: Test3B -:- Test3C", ""]
> 
> # Out1 is just itself unless if it is empty
> # Out2 is everything left of ":" (including ":" i.e. part A) and
> right of "-:-" (excluding "-:-" i.e. part C)
>     # If text doesn't contain "-:-" then return text itself as it is
> # Out3 is everything right of "-:-" (excluding "-:-" i.e. part C)
>    # If text doesn't contain "-:-" but does contains ":" then
> return part B only
>    # If it doesn't contain ":" then return itself (unless if it
> empty then "None")

I believe you want something like

  examples = [
    ("", (None, None, None)),
    ("Test1A", ("Test1A", None, None)),
    ("Test2A: Test2B", ("Test2A", "Test2B", None)),
    ("Test3A: Test3B -:- Test3C", ("Test3A", "Test3B", "Test3C")),
    # three test-cases with no provided expectations
    ("Test4A -:- Test4B", None),
    ("Test5A : Test5B : Test5C -:- Test5D", None),
    ("Test6A : Test6B -:- Test6C -:- Test6D", None),
    ]

  def clean(t):
    return [
      s.strip() if s is not None else s
      for s in t
      ]

  for s, expected in examples:
    out1 = out2 = out3 = None
    if ":" in s:
      if "-:-" in s:
        left, _, out3 = clean(s.partition("-:-"))
        if ":" in left:
          out1, _, out2 = clean(left.partition(":"))
        else:
          out1 = left
      else:
        out1, _, out2 = clean(s.partition(":"))
    else:
      if s:
        out1 = s
    result = (out1, out2, out3)
    if expected is not None:
      if result != expected:
        print("FAIL: %r got %r, not %r" % (s, result, expected))
      else:
        print("PASS: %r got %r" % (s, result))
    else:
      print("UNKN: %r got %r" % (s, result))

which gives me

PASS: '' got (None, None, None)
PASS: 'Test1A' got ('Test1A', None, None)
PASS: 'Test2A: Test2B' got ('Test2A', 'Test2B', None)
PASS: 'Test3A: Test3B -:- Test3C' got ('Test3A', 'Test3B', 'Test3C')
UNKN: 'Test4A -:- Test4B' got ('Test4A', None, 'Test4B')
UNKN: 'Test5A : Test5B : Test5C -:- Test5D' got ('Test5A', 'Test5B : Test5C', 'Test5D')
UNKN: 'Test6A : Test6B -:- Test6C -:- Test6D' got ('Test6A', 'Test6B', 'Test6C -:- Test6D')

I find that a good bit more readable than the atrocity of that
regular expression.

-tkc