Duplicate content filter..

Lawrence D'Oliveiro ldo at geek-central.gen.new_zealand
Fri Oct 5 00:45:37 EDT 2007

Previous message (by thread): Duplicate content filter..
Next message (by thread): Strange generator problem
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

In message <1191428555.278268.253700 at g4g2000hsf.googlegroups.com>, Abandoned
wrote:

> I want to a idea for how can i find duplicate pages quickly and fast ?

Compute a hash based on a canonicalized version of the content? Disregard
white space, line wrap, upper/lower case, possibly even punctuation etc so
that you get the same hash in spite of these differences.

Previous message (by thread): Duplicate content filter..
Next message (by thread): Strange generator problem
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-list mailing list