Extracting parts of string between anchor points
Tim Chase
python.list at tim.thechases.com
Thu Feb 27 16:45:35 EST 2014
On 2014-02-27 20:07, Jignesh Sutar wrote:
> I've kind of got this working but my code is very ugly. I'm sure
> it's regular expression I need to achieve this more but not very
> familiar with use regex, particularly retaining part of the string
> that is being searched/matched for.
While I suppose this could be done with regular expressions, in this
case it seems like a bit of overkill. Just to show that it can be
done, I give you this monstrosity:
>>> examples = ["Test1A",
... "Test2A: Test2B",
... "Test3A: Test3B -:- Test3C",
... ""]
>>> import re
>>> r = re.compile(r"^([^:]*)(?::((?:(?!-:-).)*)(?:-:-(.*))?)?")
>>> [r.match(s).groups() for s in examples]
[('Test1A', None, None), ('Test2A', ' Test2B', None), ('Test3A', '
Test3B ', ' Test3C'), ('', None, None)]
You'd still have to strip those values that are strings, but that
gets you the core of what you're seeking.
You do omit several edge cases:
to_test = [
"Test4A -:- Test4D", # no ":"
"Test4A : Test4B : Test4C -:- Test4D", # 2x ":"
"Test4A : Test4B -:- Test4C -:- Test4D", # 2x "-:-"
]
what should Out2 and Out3 be in those particular instances?
> Notes and code below to demonstrate what I am trying to achieve.
> Any help, much appreciated.
>
> Examples=["Test1A",
> "Test2A: Test2B",
> "Test3A: Test3B -:- Test3C", ""]
>
> # Out1 is just itself unless if it is empty
> # Out2 is everything left of ":" (including ":" i.e. part A) and
> right of "-:-" (excluding "-:-" i.e. part C)
> # If text doesn't contain "-:-" then return text itself as it is
> # Out3 is everything right of "-:-" (excluding "-:-" i.e. part C)
> # If text doesn't contain "-:-" but does contains ":" then
> return part B only
> # If it doesn't contain ":" then return itself (unless if it
> empty then "None")
I believe you want something like
examples = [
("", (None, None, None)),
("Test1A", ("Test1A", None, None)),
("Test2A: Test2B", ("Test2A", "Test2B", None)),
("Test3A: Test3B -:- Test3C", ("Test3A", "Test3B", "Test3C")),
# three test-cases with no provided expectations
("Test4A -:- Test4B", None),
("Test5A : Test5B : Test5C -:- Test5D", None),
("Test6A : Test6B -:- Test6C -:- Test6D", None),
]
def clean(t):
return [
s.strip() if s is not None else s
for s in t
]
for s, expected in examples:
out1 = out2 = out3 = None
if ":" in s:
if "-:-" in s:
left, _, out3 = clean(s.partition("-:-"))
if ":" in left:
out1, _, out2 = clean(left.partition(":"))
else:
out1 = left
else:
out1, _, out2 = clean(s.partition(":"))
else:
if s:
out1 = s
result = (out1, out2, out3)
if expected is not None:
if result != expected:
print("FAIL: %r got %r, not %r" % (s, result, expected))
else:
print("PASS: %r got %r" % (s, result))
else:
print("UNKN: %r got %r" % (s, result))
which gives me
PASS: '' got (None, None, None)
PASS: 'Test1A' got ('Test1A', None, None)
PASS: 'Test2A: Test2B' got ('Test2A', 'Test2B', None)
PASS: 'Test3A: Test3B -:- Test3C' got ('Test3A', 'Test3B', 'Test3C')
UNKN: 'Test4A -:- Test4B' got ('Test4A', None, 'Test4B')
UNKN: 'Test5A : Test5B : Test5C -:- Test5D' got ('Test5A', 'Test5B : Test5C', 'Test5D')
UNKN: 'Test6A : Test6B -:- Test6C -:- Test6D' got ('Test6A', 'Test6B', 'Test6C -:- Test6D')
I find that a good bit more readable than the atrocity of that
regular expression.
-tkc
More information about the Python-list
mailing list