Duplicate content filter..

Michael mogmios at mlug.missouri.edu
Fri Oct 5 05:20:00 EDT 2007


Markov chains maybe? You could probably create a pretty fast db of sorts for
checking on word to page relationships and word to word relationships. Or
else a normal db could probably do it pretty fast. Break the page into words
and remove common words (*prepositions, etc)* and keep a database of
word:page pairs. Simply go through the words on the new page and check for
other pages with the same words. If any other page scores to high then there
you go. I'd probably go with the simple chains as it's a lot lighter
solution and could probably be made to be pretty quick.

On 10/3/07, Abandoned <besturk at gmail.com> wrote:
>
> Hi..
> I'm working a search engine project now. And i have a problem. My
> problem is Duplicate Contents..
> I can find the percentage of similarity between two pages but i have a
> 5 millions index and i search 5 million page contents  to find one
> duplicate :(
>
> I want to a idea for how can i find duplicate pages quickly and fast ?
>
> Please help me, i'm sorry my bad english.
> King regards..
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20071005/24a85279/attachment.html>


More information about the Python-list mailing list