website catcher
Diez B. Roggisch
deets at web.de
Sun Jul 3 09:47:42 EDT 2005
jwaixs wrote:
> If I should put the parsedwebsites in, for example, a tablehash it will
> be at least 5 times faster than just putting it in a file that needs to
> be stored on a slow harddrive. Memory is a lot faster than harddisk
> space. And if there would be a lot of people asking for a page all of
> them have to open that file. if that are 10 requests in 5 minutes
> there's no real worry. If they are more that 10 request per second you
> really have a big problem and the framework would probably crash or
> will run uber slow. That's why I want to open the file only one time
> and keep it saved in the memory of the server where it don't need to be
> opened each time some is asking for it.
I don't think that's correct. An apache serves static pages with high
speed - and "slow hardrives" means about 32MByte/s nowadays. Which
equals 256MBit/s - is your machine connected to a GBit connection? And
if it's for internet usage, do you have a GBit connection - if so, I
envy you...
And if your speed has to have that high, I wonder if python can be used
at all. BTW, 10 reqeuest per seconds of maybe 100KB pages is next to
nothing - just 10MBit. It's not really fast. And images and the like are
also usually served from HD.
You are of course right that memory is faster than harddrives. but HDs
are (ususally) faster than network IO - so that's your limiting factor,
if at all. And starting CGI subrpocesses introduces also lots of
overhead - better use fastcgis then.
I think that we're talking about two things here:
- premature optimization on your side. Worry about speed later, if it
_is_ an issue. Not now.
- what you seem to want is a convenient way of having data serverd to
you in a pythonesque way. I personally don't see anything wrong with
storing and retrieving pages from HD - after all, that's where they end
up anyway ebentually. So if you write yourself a HTMLRetrieval class
that abstratcs that for you and
1) takes a piece of HTML and stores that, maybe associated with some
metadata
2) can retrieve these chunks of based on some key
you are pretty much done. If you want, you can back it up using a RDBMS,
hoping that it will do the in-memory-caching for you. But remember that
there will be no connection pooling using CGIs, so that introduces overhead.
Or you go for your own standalone process that serves the pages
through some RPC mechanism.
Or you ditch CGIs at all and use some webframework that serves from an
permanenty running python process with several worker threads - then you
can use in-process memory by global variables to store that memory. For
that, I recommend twisted.
Diez
More information about the Python-list
mailing list