Duplicate content filter..

Lawrence D'Oliveiro ldo at geek-central.gen.new_zealand
Fri Oct 5 00:45:37 EDT 2007


In message <1191428555.278268.253700 at g4g2000hsf.googlegroups.com>, Abandoned
wrote:

> I want to a idea for how can i find duplicate pages quickly and fast ?

Compute a hash based on a canonicalized version of the content? Disregard
white space, line wrap, upper/lower case, possibly even punctuation etc so
that you get the same hash in spite of these differences.



More information about the Python-list mailing list