[Tutor] regarding minhash and lsh

Alan Gauld alan.gauld at yahoo.co.uk
Mon Feb 11 05:10:36 EST 2019


On 11/02/2019 09:13, lokesh kumar wrote:

> i want to make a code to run few DNA seg. so that i will be able to find
> similarity in them. file are in million as well as seq. are large so i
> tried developing program but fails in it i think minhash and lsh can able
> to solve my problem.

Bear in mind that this is a general programming forum and
relatively few of us are from a scientific background and
even fewer work with DNA sequences. So I'm glad you have
an idea about your solution but have no idea what minhash
or lsh are, let alone whether they will help you.

Also you may find more people who understand your
work area on the scipy forums.

> i need kind of program that will be easy to handle.

What does that mean? Easy to operate? Easy to maintain?
Easy to distribute to others? All of the above?
Or something else?

As for your code, I'm not sure what that is for?
Do you want us to critique it?
Or is there a problem?
If so you will need to describe the issue and include
any error messages.

At the moment it just defines a function which is
never called...

> from scipy.spatial.distance import cosine
> from random import randint
> import numpy as np
> N = 128
> max_val = (2**32)-1
> 
> perms = [ (randint(0,max_val), randint(0,max_val)) for i in range(N)]
> vec = [float('inf') for i in range(N)]
> 
> def minhash(s, prime=4294967311):
>   '''
>   Given a set `s`, pass each member of the set through all permutation
>   functions, and set the `ith` position of `vec` to the `ith` permutation
>   function's output if that output is smaller than `vec[i]`.
>   '''
>   vec = [float('inf') for i in range(N)]
> 
>   for val in s:
>     if not isinstance(val, int): val = hash(val)
> 
>     for perm_idx, perm_vals in enumerate(perms):
>       a, b = perm_vals
>       output = (a * val + b) % prime
>        if vec[perm_idx] > output:
>         vec[perm_idx] = output
> 
>     return vec

Notice that the last if statement appears to
be incorrectly indented. But that may just be an
email glitch...

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list