reduction

Tue May 31 11:16:43 EDT 2016

> My problem. I have lists of substrings associated to values:
> 
> ['a','b','c','g'] => 1
> ['a','b','c','h'] => 1
> ['a','b','c','i'] => 1
> ['a','b','c','j'] => 1
> ['a','b','c','k'] => 1
> ['a','b','c','l'] => 0  # <- Black sheep!!!
> ['a','b','c','m'] => 1
> ['a','b','c','n'] => 1
> ['a','b','c','o'] => 1
> ['a','b','c','p'] => 1
> 
> I can check a bit of data against elements in this list and determine whether
> the value to be associated to the data is 1 or 0.
> 
> I would like to make my matching algorithm smarter so I can reduce the total
> number of lists:
> 
> ['a','b','c','l'] => 0  # If "l" is in my data I have a zero
> ['a','b','c'] => 1      # or a more generic match will do the job
> 
> I am trying to think of a way to perform this "reduction", but I have a feeling I
> am reinventing the wheel.
> 

Well, you are right about it being vague <grin>.

Ultimately, this would depend on knowing more about what you are trying to achieve and what types of data you are actually using, as well as what the ranges of potential values are.

So, for your simple example, I would do exactly what you suggested (check for the "l" and call it zero, otherwise check for the prefix and call it one.

but, if there are lots of potential different black sheep, and/or other potential answers, this would get complex pretty quickly.  Ian Kelly noted using a trie structure, allowing searching by prefix's, that could make sense if your lists always differ at the end as in your example, but might not make sense if the data is more random.   

There are lots of searching, sorting, and organizing modules and data structures out there, but without a bit more info, I am not sure that any response is going to be better than any other.

Some things to consider are, 
- what is predictable, and what is not... and how do you determine the value from the predictable parts.
     (your example was very predictable; hence it was very easy)
- how is the determination handled?  (Math?  String matching?)
    (if the method of determining it can be handled with math, you may be able to craft a formula to figure it out faster)
- how many of each object type are you looking at?
  (if you are comparing < 100 objects, sometimes brute force is much easier anyway, if > 1,000,000, using a database or other method may be required)

And some options to look at:
- trie structure (as Ian noted)
   (good if the data varies based on how "deep" in the tree you are.)
- regex search/match
    (good for more text based comparisons than you "seem" to be suggesting)
- build a custom class that has the intelligence to determine the value for each object itself.
   (good for more complex issues, but probably increases the size of each object)
- create an index or caching structure of some sort as you find the objects,
  (good for saving computation time if the determination is "hard" and you hit the same ones again and again)
- create a dictionary structure instead of a list.
  (good for fast lookups with known data)
- using something like pandas or another data analysis library (SciPy, NumPy, iPython Notebook).
  (especially good if your data is more numbers based than you seem to be indicating)
- writing the data to a database and using that for matching.
  (good If you have LOTS of data to compare).

Some of these are probably overkill, some probably won't work at all for what you are trying to achieve.

Dan