Algorithm Question
Andrew McLean
andrew-news at andros.org.uk
Sun Sep 10 12:56:56 EDT 2006
This really an algorithm question more that a Python question, but it
would be implemented in Python....
I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string
My guess is that finding the smallest possible T satisfying the
constraint would be hard. However, for my application just keeping it
reasonably small would help.
In my case the list A contains over two million addresses.
The (top down) heuristic approach I am tempted to employ is to start by
dividing the entries in A into sets of tokens, then take the union of
all these sets as a starting point for B. Then I would try to trim B by
1. looking for elements that I could remove while still satisfying the
constraint
2. replacing two elements by a common sub-string if that reduced T
Anyway. It occurred to me that this might be a known problem. Any
pointers gratefully received.
- Andrew
More information about the Python-list
mailing list