[Tutor] regarding minhash and lsh
lokesh kumar
lokesh.20111994 at gmail.com
Mon Feb 11 04:13:49 EST 2019
Hi There,
i want to make a code to run few DNA seg. so that i will be able to find
similarity in them. file are in million as well as seq. are large so i
tried developing program but fails in it i think minhash and lsh can able
to solve my problem.
i need kind of program that will be easy to handle.
from scipy.spatial.distance import cosine
from random import randint
import numpy as np
N = 128
max_val = (2**32)-1
perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)]
vec = [float('inf') for i in range(N)]
def minhash(s, prime=4294967311):
'''
Given a set `s`, pass each member of the set through all permutation
functions, and set the `ith` position of `vec` to the `ith` permutation
function's output if that output is smaller than `vec[i]`.
'''
vec = [float('inf') for i in range(N)]
for val in s:
if not isinstance(val, int): val = hash(val)
for perm_idx, perm_vals in enumerate(perms):
a, b = perm_vals
output = (a * val + b) % prime
if vec[perm_idx] > output:
vec[perm_idx] = output
return vec
More information about the Tutor
mailing list