Bug in RE's?

Fri Oct 18 19:50:20 EDT 2002

On Fri, 18 Oct 2002 at 18:14 GMT, Robin Siebler <robin.siebler at corp.palm.com> wrote:
> Robin Munn <rmunn at pobox.com> wrote in message 
> 
>>What's the RE you're searching against? 
> I have about 30 words that I am searching for.  Here is the code I am
> using to generate the RE:
> 
> pattern = '\w+word\w+|\w+word|word\w+|word'
> temp_pattern = ''
> for word in ExcludeWords: 
>     temp_pattern = pattern.replace('word', word.lower()) + '|' +
> temp_pattern
> s_word = re.compile(temp_pattern[0:-1], re.IGNORECASE)  #Strip the
> last '|'
> 
> The length of temp_pattern is 1024

Why are you making that RE needlessly complicated? The final '|word'
section will match everything that the other sections match, so why not
just say:

    pattern = 'word'

And then, while you're at it, this whole section of code could become:

    s_word = re.compile('|'.join(ExcludeWords))

There -- isn't that simpler? And much faster, too.

I have another comment on your code, too -- this one's a comment on the
code you put up at the URL listed below. I think this is a simple typo
you made when copying the FunkyList class:

----- Code snippet below -----
class FunkyList(Base):
    """
        Source: Python Cookbook
            Credit: Alex Martelli
    """
    def __init__(self, initlist=None):
        Base.__init__(self, initlist)
        self._dict_ok = 0
def __contains__(self, item):
     if not self.dict_ok:
         self._dict = dict(zip(self,self))
         self.dict_ok = 1
     return item in self._dict
----- End code snippet -----

Two things here: first, the __contains__ function should be indented to
the same level as the __init__ function if (as I assume) it's supposed
to be a member function of the FunkyList class. Remember, indentation is
significant in Python. The way you've got it written now, __contains__()
is considered a top-level function instead of being a member function of
the FunkyList class.

Second, you left out the leading underscore on two of the three
instances of _dict_ok.

> 
>>When you say the RE "stops working in the same place and starts
> working again >in the same place" do you mean the same place in the RE
> or the same place in
>> the file you're searching?
> 
> The RE always stops working on the exact same line in the exact same
> file.  When it starts working again, it starts working on the exact
> same line in the exact same file.  Sometimes it stops and starts
> working within the same file and sometimes it stops working in one
> file and doesn't start working until the next file, however, this is
> always consistent as well.

Consistent is good. Consistent bugs are bugs that can be reproduced,
quickly nailed down, and fixed. It's the INconsistent ones that are a
pain in the anatomy.

>  
>> If the RE and/or the example file(s) are short, post them here and we
>> can try to figure out what's going on. If they're long, you might put
>> them up somewhere and post a URL. It would also help if you would quote
>> the relevant line(s) from the file(s) where the RE starts and/or stops
>> working.
>  
> I ran the good version of my script against a bunch of .py files and I
> ran the bad version against the same files.  I copied the logs
> generated by the scripts, the scripts, and all of the files the
> scripts need to
> https://secure.americasnet.com/321.net/files/robinsiebler/public/. 
> You will need to install Optik (http://optik.sourceforge.net/) to run
> my scripts.

I still can't reproduce your bug because I still don't have everything I
need to do so. To reproduce your bug, I would need:

1. A copy of the data file(s) that are triggering the bug
2. A clear explanation of what the expected behavior was, and how and
   when the observed behavior differs from the expected behavior

Remember: the person who's looking at your code doesn't know the code,
and doesn't have the same data files that you have. Think about exactly
what he'll need to be able to reproduce your bug, and give him that.

Beyond the subject of reproducing this bug, I have one further comment.
Looking at your code, it looks like there is a LOT of code devoted to
handling special cases. Special-case code, in my experience, is where
bugs like to hide about 95% of the time. I would suggest that you do a
complete re-write of your search code, starting with only the basics.
Read a line at a time, search for a simple RE, print matches. That's it;
no bells and whistles yet. Then once you're satisfied that that code is
always working right, start adding the special cases, ONE AT A TIME. And
test thoroughly after each one.

What I'm describing here is unit testing. Break your code down into
comprehensible units and test each one until you're sure there are no
bugs in that function. Then move on to the next. Look at the unittest
module for an easy way to write your unit tests.

-- 
Robin Munn <rmunn at pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838