[Tutor] regarding minhash and lsh

lokesh kumar lokesh.20111994 at gmail.com
Mon Feb 11 04:13:49 EST 2019


Hi There,
i want to make a code to run few DNA seg. so that i will be able to find
similarity in them. file are in million as well as seq. are large so i
tried developing program but fails in it i think minhash and lsh can able
to solve my problem.
i need kind of program that will be easy to handle.

from scipy.spatial.distance import cosine
from random import randint
import numpy as np
N = 128
max_val = (2**32)-1

perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)]
vec = [float('inf') for i in range(N)]

def minhash(s, prime=4294967311):
  '''
  Given a set `s`, pass each member of the set through all permutation
  functions, and set the `ith` position of `vec` to the `ith` permutation
  function's output if that output is smaller than `vec[i]`.
  '''

  vec = [float('inf') for i in range(N)]

  for val in s:


    if not isinstance(val, int): val = hash(val)


    for perm_idx, perm_vals in enumerate(perms):
      a, b = perm_vals


      output = (a * val + b) % prime
       if vec[perm_idx] > output:
        vec[perm_idx] = output

    return vec


More information about the Tutor mailing list