Dealing with marketing types...

Sun Jun 12 17:52:48 EDT 2005

Paul Rubin wrote:
> Andrew Dalke <dalke at dalkescientific.com> writes:
  ...
>> I found more details at
>> http://jeremy.zawodny.com/blog/archives/001866.html
>> 
>> It's a bunch of things - Perl, C, MySQL-InnoDB, MyISAM, Akamai,
>> memcached.  The linked slides say "lots of MySQL usage." 60 servers.
> 
> LM uses MySQL extensively but what I don't know is whether it serves
> up individual pages by the obvious bunch of queries like a smaller BBS
> might.  I have the impression that it's more carefully tuned than that.

The linked page links to a PDF describing the architecture.
The careful tuning comes in part from a high-performance caching
system - memcached.

>> I don't see that example as validating your statement that
>> LAMP doesn't scale for mega-numbers of hits any better than
>> whatever you might call "printing press" systems.
> 
> What example?  Slashdot?

Livejournal.  You gave it as a counter example to the LAMP
architecture used by /.

]  It seems to me that by using implementation methods that
] map more directly onto the hardware, a site with Slashdot's
] traffic levels could run on a single modest PC (maybe a laptop).
] I believe LiveJournal (which has something more like a million
] users) uses methods like that, as does ezboard. 

Since LJ uses a (highly hand-tuned) LAMP architecture, it isn't
an effective counterexample.

>  It uses way more hardware than it needs to,
> at least ten servers and I think a lot more.  If LJ is using 6x as
> many servers and taking 20x (?) as much traffic as Slashdot, then LJ
> is doing something more efficiently than Slashdot.  

I don't know where the 20x comes from.  Registered users?  I
read /. but haven't logged into it in 5+ years.  I know I
hit a lot /. more often than I do LJ (there's only one diary
I follow there).  The use is different as well; all people
hit one story / comments page, and the comments are ranked
based on reader-defined evaluations.  LJ has no one journal
that gets anywhere as many hits and there is no ranking scheme.

>> I'ld say that few sites have >100k users, much less
>> daily users with personalized information. As a totally made-up
>> number, only few dozens of sites (maybe a couple hundred?) would
>> need to worry about those issues.
> 
> Yes, but for those of us interested in how big sites are put together,
> those are the types of sites we have to think about ;-).

My apologies since I know this sounds snide, but then why didn't
you (re)read the LJ architecture overview I linked to above?
That sounds like something you would have been interested in
reading and would have directly provided information that
counters what you said in your followup.

The "ibm-poop-heads" article by Ryan Tomayko gives pointers to 
several other large-scale LAMP-based web sites.  You didn't
like the Google one.  I checked a couple of the others:

  IMDB -
  http://www.findarticles.com/p/articles/mi_zdpcm/is_200408/ai_ziff130634
  As you might expect, the site is now co-located with other Amazon.com
  sites, served up from machines running Linux and Apache, but ironically,
  most of the IMDb does not use a traditional database back end. Its
  message boards are built on PostgreSQL, and certain parts of IMDb
  Pro-including its advanced search-use MySQL, but most of the site is
  built with good old Perl script.

  del.icio.us
  Took some digging but I found
  http://lists.del.icio.us/pipermail/discuss/2004-November/001421.html
  "The database gets corrupted because the machine gets power-cycled,
  not through any fault of MySQL's."

The point is that LAMP systems do scale, both down and up.  It's
a polemic against "architecture astronauts" who believe the only
way to handle large sites (and /., LJ, IMDB, and del.icio.us are
larger than all but a few sites) is with some spiffy "enterprise"
architecture framework.

> I'd say
> there's more than a few hundred of them, but it's not like there's
> millions.  And some of them really can't afford to waste so much
> hardware--look at the constant Wikipedia fundraising pitches for more
> server iron because the Wikimedia software (PHP/MySQL, natch) can't
> handle the load.

Could that have, for example, bought EnterpriseWeb-O-Rama and done
any better/cheaper?  Could they have even started the project
had they gone that route?

> Yes, of course there is [exprience in large-scale web apps]. 
> Look at the mainframe transaction systems of the 60's-70's-80's, for
> example. Look at Google.

For the mainframe apps you'll have to toss anything processed
in batch mode, like payrolls.  What had the customization level
and scale comparable to 100K+ sites of today?  ATMs?  Stock trading?

Google is a one-off system.  At present there's no other system
I know of - especially one with that many users - where a single
user request can trigger searches from hundreds of machines.
That's all custom software.  Or should most servers implement
what is in essence a new distributed operating system just to
run a web site?

>  Then there's the tons of experience we all have with LAMP systems.  By
> putting some effort into seeing where the resources in those things go,
> I believe we can do a much better job.  In particular, those sites like
> Slashdot are really not update intensive in the normal database sense.
> They can be handled almost entirely with some serial log files plus some
> ram caching.  At that point almost all the SQL overhead and a lot of the
> context switching can go away.

Is /. an appropriate comparable?  I get the idea that it hasn't
changed much in the last, say, 5 years and the user base hasn't
grown much either.  What you propose requires programming effort.
If the system doesn't need work, if money in > money out (even with
expensive hardware), and if the extra work doesn't get much benefit,
then is it worthwhile to them to rearchitect the system?

Perhaps in a couple years it'll run on two machines (one as the
backup), with no change to the code, and simply because the
hardware is good enough and cheap enough.

				Andrew
				dalke at dalkescientific.com