Help debugging code - Negative lookahead problem

michael.gauthier.uni at gmail.com michael.gauthier.uni at gmail.com
Sun Feb 26 09:13:30 EST 2017


Hi everyone,

So here is my problem. I have a bunch of tweets and various metadata that I want to analyze for sociolinguistic purposes. In order to do this, I'm trying to infer users' ages thanks to the information they provide in their bio, among others. For that I'm using regular expressions to match a couple of recurring patterns in users' bio, like a user mentioning a number followed by various spellings of "years old" as in:

"John, 30 years old, engineer."

The reason why I'm using regexes for this is that there are actually very few ways people use to mention there age on Twitter, so just three or four regexes would allow me to infer the age of most users in my dataset. However, in this case I also want to check for what comes after "years old", as many people mention their children's age, and I don't want this to be incorrectly associated to the user's age, as in:

"John, father of a 12 year old kid, engineer"


So cases as the one above should be ignored, so that I can only keep users for whom a valid age can be inferred.
My program looks like this:


import csv
import re
 
with open("test_corpus.csv") as corpus:
    corpus_read = csv.reader(corpus, delimiter=",")
    for row in corpus_read:
        if re.findall(r"\d{2}\s?(?=years old\s?|yo\s?|yr old\s?|y o\s?|yrs  old\s?|year old\s?(?!son|daughter|kid|child))",row[5].lower()):
            age = re.findall(r"\d{2}\s?",row[5].lower())
            for i in age:
                print(i)


The program seems to work in some cases, but in the small test file I created to try it out, it incorrectly matches the age mentioned in the string "I have a 12 yo son", and returns 12 as a matched age, which I don't want it to. I'm guessing this has something to do with brackets or delimiters at some point in the program, but I spent a few days on it, and I could not find anything helpful around here or on other forums, so any help would be appreciated.

Thus, the actual question is: how to make the program not recognize 12 in "John, father of a 12 year old kid, engineer" as the age of the user, based on the program I already have?

I am somewhat new at programming, so apologies if I forgot to mention something important, do not hesitate to tell me if you need more details.

Thanks in advance for any help you could provide!



More information about the Python-list mailing list