Duplicate content filter..

Abandoned besturk at gmail.com
Wed Oct 3 12:22:35 EDT 2007


Hi..
I'm working a search engine project now. And i have a problem. My
problem is Duplicate Contents..
I can find the percentage of similarity between two pages but i have a
5 millions index and i search 5 million page contents  to find one
duplicate :(

I want to a idea for how can i find duplicate pages quickly and fast ?

Please help me, i'm sorry my bad english.
King regards..




More information about the Python-list mailing list