Using filepath method to identify an .html page

Dave Angel d at davea.name
Tue Jan 22 14:00:16 EST 2013


On 01/22/2013 01:26 PM, Ferrous Cranus wrote:
>
>> <snip>
>
> sub hashit {
>     my $url=shift;
>     my @ltrs=split(//,$url);
>     my $hash = 0;
>
>     foreach my $ltr(@ltrs){
>          $hash = ( $hash + ord($ltr)) %10000;
>     }
>     printf "%s: %0.4d\n",$url,$hash
>
> }
>
>
> which yields:
> $ perl testMD5.pl
> /index.html: 1066
> /about/time.html: 1547
>

If you use that algorithm to get a 4 digit number, it'll look good for 
the first few files.  But if you try 100 files, you've got almost 40% 
chance of a collision, and if you try 10001, you've got a 100% chance.


So is it really okay to reuse the same integer for different files?

I tried to help you when you were using the md5 algorithm.  By using 
enough digits/characters, you can cut the likelihood of a collision 
quite small.  But 4 digits, don't be ridiculous.


-- 
DaveA



More information about the Python-list mailing list