Using filepath method to identify an .html page

Chris Angelico rosuav at gmail.com
Tue Jan 22 09:33:03 EST 2013


On Wed, Jan 23, 2013 at 12:57 AM, Ferrous Cranus <nikos.gr33k at gmail.com> wrote:
> Τη Τρίτη, 22 Ιανουαρίου 2013 3:04:41 μ.μ. UTC+2, ο χρήστης Steven D'Aprano έγραψε:
>
>> What do you expect int("my-web-page.html") to return? Should it return 23
>> or 794 or 109432985462940911485 or 42?
>
> I expected a unique number from the given string to be produced so i could have a (number <=> string) relation. What does int( somestring ) is returning really? i don;t have IDLE to test.

Just run python without any args, and you'll get interactive mode. You
can try things out there.

> This counter.py will work on a shared hosting enviroment, so absolutes paths are BIG and expected like this:
>
> /home/nikos/public_html/varsa.gr/articles/html/files/index.html

That's not big. Trust me, modern databases work just fine with unique
indexes like that. The most common way to organize the index is with a
binary tree, so the database has to look through log(N) entries.
That's like figuring out if the two numbers 142857 and 857142 are the
same; you don't need to look through 1,000,000 possibilities, you just
need to look through the six digits each number has.

> 'pin' has to be a number because if i used the column 'page' instead, just imagine the database's capacity withholding detailed information for each and every .html requested by visitors!!!

Not that bad actually. I've happily used keys easily that long, and
expected the database to ensure uniqueness without costing
performance.

> So i really - really need to associate a (4-digit integer <=> htmlpage's absolute path)

Is there any chance that you'll have more than 10,000 pages? If so, a
four-digit number is *guaranteed* to have duplicates. And if you
research the Birthday Paradox, you'll find that any sort of hashing
algorithm is likely to produce collisions a lot sooner than that.

> Maybe it can be done by creating a MySQL association between the two columns, but i dont know how such a thing can be done(if it can).
>
> So, that why i need to get a "unique" number out of a string. please help.

Ultimately, that unique number would end up being a foreign key into a
table of URLs and IDs. So just skip that table and use the URLs
directly - much easier. In this instance, there's no value in
normalizing.

ChrisA



More information about the Python-list mailing list