Programming challenge: wildcard exclusion in cartesian products

Thu Mar 23 12:26:15 EST 2006

Dirk Thierbach wrote:
> Dinko Tenev <dinko.tenev at gmail.com> wrote:
> > Dirk Thierbach wrote:
> >> If more time during preprocessing is allowed, another idea is to
> >> treat the wildcard expressions as regular expressions, convert
> >> each into a finite state machine, construct the "intersection" of
> >> all these state machines, minimize it and then swap final and non-final
> >> states.
>
> > Given the requirements, did you mean taking the *union* and swapping
> > states?  Or maybe swapping states first, and then taking the
> > intersection?
>
> Whatever the requirements were. Take your pick. :-)

OK :)

> >> Then you can use the resulting automaton to efficiently
> >> enumerate S^n - W. In the above case, the resulting FSM would have just
> >> three states.
>
> > I don't see immediately how exactly this is going to work.  Unless I'm
> > very much mistaken, a FSA in the classical sense will accept or reject
> > only after the whole sequence has been consumed, and this spells
> > exponential times.
>
> Exponential in respect to what?

With respect to sequence length.  In the example above (S = {a, b}, W =
{*a*b, *b*a}) you have to enumerate |S^n| potential sequences to get to
the couple (a^n, b^n) that are of particular interest.

> You just recursively walk the
> automaton for every word of length n, so at most you'll have to check
> every word in S^n once.

That's right -- the sky is the limit ;)

> Hence, the complexity is not worse than the
> "generate and test" approach (it's in fact better, since testing is
> trivial).

Testing per sequence will be faster, yes, but the overall running time
will still be a disaster, just like with the "generate and test"
solutions.  The Python program tries to do better than that, and it
succeeds some of the time.

> However, if the result set is simple (as in the example), then the
> result FSM will have states that won't have transitions for every
> letters, so I guess the average case will be a lot better.

I don't believe this case should ever arise out of the current
definition of the problem: labels are specified explicitly only for the
excluded subset, and you have to accept everything else by default.

> > For improved asymptotic complexity in this case,
> > you need to be able to at least reject in mid-sequence,
>
> One can do that if there is no valid transition out of some state.

One possibly can, but such states should never be encountered (see
above.)

Looking at the minimal DFA for the above example, it may be more
appropriate to detect being in a state from which there's no path to a
final state:

S: a -> A, b -> B
A: a -> A, b -> F
B: a -> F, b -> B
F: a -> F, b -> F

Here, S is the starting state, {S, A, B} are final, and F is non-final
(after swapping.)  Obviously, the smart thing to do is to bail out as
soon as you're in F.  The point is, though, are things guaranteed to be
as simple in the general case? :)

> I guess one could even optimize on that: In the minimal automaton,
> every state has some transition sequence that will end up in a final
> state. Annotate each state with the minimum number of steps for
> such a sequence. While walking the automaton, keep the maximum
> number of letters to produce in a variable. If this number is small
> then the number in the annotation, stop and backtrace.

I don't think this could cut efficiently for patterns like *a*b.

Cheers,

Dinko