Algorithm Question

Andrew McLean andrew-news at andros.org.uk
Sun Sep 10 12:56:56 EDT 2006


This really an algorithm question more that a Python question, but it 
would be implemented in Python....

I have a list of strings, A. I want to find a set of strings B such that 
for any "a in A" there exists "b in B" such that b is a sub-string of a. 
But I also want to minimise T = sum_j t_j where

t_j = count of the number of elements in A which have b[j] as a sub-string

My guess is that finding the smallest possible T satisfying the 
constraint would be hard. However, for my application just keeping it 
reasonably small would help.

In my case the list A contains over two million addresses.

The (top down) heuristic approach I am tempted to employ is to start by 
dividing the entries in A into sets of tokens, then take the union of 
all these sets as a starting point for B. Then I would try to trim B by

1. looking for elements that I could remove while still satisfying the 
constraint

2. replacing two elements by a common sub-string if that reduced T

Anyway. It occurred to me that this might be a known problem. Any 
pointers gratefully received.

- Andrew




More information about the Python-list mailing list