Help debugging code - Negative lookahead problem

Peter Otten __peter__ at web.de
Sun Feb 26 13:57:22 EST 2017


michael.gauthier.uni at gmail.com wrote:

> Hi MRAB,
> 
> Thanks for taking time to look at my problem!
> 
> I tried your solution:
> 
> r"\d{2}\s?(?=(?:years old\s?|yo\s?|yr old\s?|y o\s?|yrs  old\s?|year
> old\s?)(?!son|daughter|kid|child))"
> 
> but unfortunately it does seem not work. Also, I tried adding the negative
> lookaheads after every one of the alternatives, but it does not work
> either, so the problem does not seem to be that the negative lookahead
> applies only to the last proposition... : (
> 
> Also, \d{2} will only match two single digits, and won't match the last
> two digits of 101, so at least this is fine! : )
> 
> Any other idea to improve that code? I'm starting to get desperate...

If your code becomes too complex to manage it break it into simpler parts. 
In this case you can use two simple regular expressions:

>>> age = re.compile(r"\d+")
>>> child = re.compile(r"\s+kid")
>>> text = "42 bar baz foo 12 kid"
>>> for candidate in age.finditer(text):
...     if child.match(text, candidate.end()):
...         print("Kid's age:", candidate.group())
...     else:
...         print("Author's age:", candidate.group())
... 
Author's age: 42
Kid's age: 12


Applying that idea (and the principle to break everything into dead easy 
parts) to your problem:

$ cat demo.py    
import re

def longest_first(text):
    return sorted(text.splitlines(), key=len, reverse=True)

YEARS = longest_first("""\
year
years
year old
years old
yo
ys o
""")

CHILDREN = longest_first("""\
son
daughter
kid
child
""")

YEARS_RE = r"\b(?P<age>\d+) ({})".format("|".join(YEARS))
re_years = re.compile(YEARS_RE)

CHILD_RE = r" ({})\b".format("|".join(CHILDREN))
re_child = re.compile(CHILD_RE)


def followed_by_child(candidate):
    return re_child.match(candidate.string, candidate.end())


CORPUS = """\
jester, 42 years old, 20 years kidding
12 years kid
engineer, 30 years
engineer, 30 years old daughter
""".splitlines()

for text in CORPUS:
    print(text)
    for m in re_years.finditer(text):
        age = m.group("age")
        if followed_by_child(m):
            print("    rejected:", age)
        else:
            print("    accepted:", age)
    
$ python3 demo.py
jester, 42 years old, 20 years kidding
    accepted: 42
    accepted: 20
12 years kid
    rejected: 12
engineer, 30 years
    accepted: 30
engineer, 30 years old daughter
    rejected: 30





More information about the Python-list mailing list