[Tutor] Iterating through a list of replacement regex patterns

Sat Sep 4 04:02:08 CEST 2010

On Fri, Sep 3, 2010 at 9:57 PM, David Hutto <smokefloat at gmail.com> wrote:
> First of all, I'll respond more thoroughly tomorrow, when I can review
> what you said more clearly, but for now I'll clarify.
>
> Here is the whole code that I'm using:
>
> http://pastebin.com/Ak8DFjrb
>
> On Fri, Sep 3, 2010 at 9:12 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> On Fri, 3 Sep 2010 12:24:00 pm David Hutto wrote:
>>> In the below function I'm trying to iterate over the lines in a
>>> textfile, and try to match with a regex pattern that iterates over
>>> the lines in a dictionary(not {}, but turns a text list of
>>> alphabetical words into a list using readlines()).
>>>
>>> def regexfiles(filename):
>> [...]
>>
>> Your function does too much. It:
>>
>> 1 manages a file opened for input and output;
>> 2 manages a dictionary file;
>> 3 handles multiple searches at once (findall and search);
>> 4 handles print output;
>>
>> and you want it to do more:
>>
>> 5 handle multiple "select" terms.
>>
>>
>> Your function is also poorly defined -- I've read your description
>> repeatedly, and studied the code you give, and your three follow up
>> posts, and I still don't have the foggiest clue of what it's supposed
>> to accomplish!
>
> This is supposed to recreate a thought experiment I've heard about, in which,
> if you have an infinite amount of monkeys, with an infinite amount of
> typewriters,
> they'll eventually spit out Shakespeare.
>
> So, I create a random length file, with random characters, then regex
> it for the iteration
> of dictionary terms, but the regex is needed further for the
> theoretical exploratory purposes
> into the thought experiment. If dictionary patterns are matched, then
> it needs further regex
> for grammatically proper structures, even if they don't make
> sense(take mad libs for example),
> but are still grammatically correct, randomly produced files.
>
> So the bulldozer is needed.

I forgot, the randomly generated files with random len, are regexed,
for words,, then sorted into a range file
for len of the words contained, then regexed for grammatical
structure, and sorted again.

The latters of this have not been set in yet, just up until it finds
the len of real words in the file, then regex of
the grammar is next on my list. So it's more practice with regex, than
use a bulldozer to dig a fire pit.
>
>
> You do two searches on each iteration, but other than
>> print the results, you don't do anything with it.
>
> I had to print the results, in order to understand why using 'apple'
> in a variable
> yielded something different than when I iterated over the text file.
> The end result was that
> the list of dictionary words ['a\n'', 'b\n'] had \n, which was the
> extra character in the iteration I was referring to,
> and thanks to printing it out I was able to further isolate the
> problem through len().
>
> So rstrip() removed '\n' from the iterated term in the text file,
> yielding just the ['term'], and not ['term\n'].
>
> Print helps you see the info first hand.
>
>>
>> What is the *meaning* of the function? "regexfiles" is a meaningless
>> name, and your description "I'm trying to iterate over the lines in a
>> textfile, and try to match with a regex pattern that iterates over the
>> lines in a dictionary" is confusing -- the first part is fine, but what
>> do you mean by a regex that iterates over the lines in a dictionary?
>>
>> What is the purpose of a numeric variable called "search"? It looks like
>> a counter to me, not a search, it is incremented each time through the
>> loop. The right way to do that is with enumerate(), not a manual loop
>> variable.
>>
>> Why do you call readlines() instead of read()? This makes no sense to
>> me. You then convert a list of strings into a single string like this:
>>
>> readlines() returns ['a\n', 'b\n', 'c\n']
>> calling str() returns "['a\n', 'b\n', 'c\n']"
>>
>> but the result includes a lot of noise that weren't in the original
>> file: open and close square brackets, commas, quotation marks.
>>
>> I think perhaps you want ''.join(readlines()) instead, but even that is
>> silly, because you should just call read() and get 'a\nb\nc\n'.
>>
>> You should split this up into multiple functions. It will make
>> comprehension, readability, debugging and testing all much, much
>> easier. Code should be self-documenting -- ideally you will never need
>> to write a comment, because what the code does should be obvious from
>> your choice of function and variable names.
>>
>> I don't understand why you're using regex here. If I'm guessing
>> correctly, the entries in the dictionary are all ordinary (non-regex)
>> strings, like:
>>
>> ape
>> cat
>> dog
>>
>> only many, many more words :)
>>
>> Given that, using the re module is like using a bulldozer to crack a
>> peanut, and about as efficient. Instead of re.search(target, text) you
>> should just use text.find(target). There's no built-in equivalent to
>> re.findall, but it's not hard to write one:
>>
>> def findall(text, target):
>>    results = []
>>    start = 0
>>    p = text.find(target)
>>    while p != -1:
>>        results.append(p)
>>        p = text.find(target)
>>    return results
>>
>> (This returns a list of starting positions rather than words. It's easy
>> to modify to do the other -- just append target instead of p.)
>>
>>
>> Anyway, here's my attempt to re-write your function. I've stuck to
>> regexes just in case, and there's lots of guess-work here, because I
>> don't understand what you're trying to accomplish, but here goes
>> nothing:
>
> The above should explain a little more, and tomorrow, I'll thoroughly
> review your post.
>>
>>
>> # Change this to use real English  *wink*
>> DEFAULT_DICT =  '/var/www/main/american-english'
>>
>> def get_dict_words(filename=''):
>>    """Return a list of words from the given dictionary file.
>>
>>    If not given, a default dictionary is used.
>>    The format of the file should be one word per line.
>>    """
>>    if not filename:
>>        filename = DEFAULT_DICT
>>    # Don't use file, use open.
>>    dictionary = open(filename)
>>    words = dictionary.readlines()
>>    # Best practice is to explicitly close files when done.
>>    dictionary.close()
>>    return words
>>
>>
>> def search(target, text):
>>    """Return the result of two different searches for target in text.
>>
>>    target should be a regular expression or regex object; text should
>>    be a string.
>>
>>    Result returned is a two-tuple:
>>    (list of matches, regex match object or None)
>>    """
>>    if isinstance(target, str):
>>        target = re.compile(target)
>>    a = target.findall(text)
>>    b = target.search(text)
>>    return (a, b)
>>
>>
>> def print_stuff(i, target, text, a, b):
>>    # Boring helper function to do printing.
>>    print "counter = %d" % i
>>    print "target = %s" % target
>>    print "text = %s" % text
>>    print "findall = %s" % a
>>    print "search = %s" % b
>>    print
>>
>>
>> def main(text):
>>    """Print the results of many regex searches on text."""
>>    for i, word in get_dict_words():
>>        a, b, = search(word, text)
>>        print_stuff(i, word, text, a, b)
>>
>>
>> # Now call it:
>> fp = open(filename)
>> text = fp.read()
>> fp.close()
>> main(text)
>>
>>
>>
>> I *think* this should give the same results you got.
>>
>>
>> --
>> Steven D'Aprano
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
>