How to pick out the same titles.

duncan smith duncan at invalid.invalid
Sun Oct 16 14:19:57 EDT 2016


On 16/10/16 16:16, Seymore4Head wrote:
> How to pick out the same titles.
> 
> I have a  long text file that has movie titles in it and I would like
> to find dupes.
> 
> The thing is that sometimes I have one called "The Killing Fields" and
> it also could be listed as "Killing Fields"  Sometimes the title will
> have the date a year off.
> 
> What I would like to do it output to another file that show those two
> as a match.
> 
> I don't know the best way to tackle this.  I would think you would
> have to pair the titles with the most consecutive letters in a row.
> 
> Anyone want this as a practice exercise?  I don't really use
> programming enough to remember how.
> 

Tokenize, generate (token) set similarity scores and cluster on
similarity score.


>>> import tokenization
>>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2,
pad=True)
>>> bigrams1
['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ',
' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_']
>>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True)
>>> import pseudo
>>> pseudo.Jaccard(bigrams1, bigrams2)
0.7


You could probably just generate token sets, then iterate through all
title pairs and manually review those with similarity scores above a
suitable threshold. The code I used above is very simple (and pasted below).


def n_grams(s, n, pad=False):
    # n >= 1
    # returns a list of n-grams
    # or an empty list if n > len(s)
    if pad:
        s = '_' * (n-1) + s + '_' * (n-1)
    return [s[i:i+n] for i in range(len(s)-n+1)]

def Jaccard(tokens1, tokens2):
    # returns exact Jaccard
    # similarity measure for
    # two token sets
    tokens1 = set(tokens1)
    tokens2 = set(tokens2)
    return len(tokens1&tokens2) / len(tokens1|tokens2)


Duncan





More information about the Python-list mailing list