OT: MoinMoin and Mediawiki?

Wed Jan 12 14:53:30 EST 2005

Paul Rubin <http://phr.cx@NOSPAM.invalid> writes:
> > > How does it do that?  It has to scan every page in the entire wiki?!
> > > That's totally impractical for a large wiki.
> > 
> > So you want to say that c2 is not a large wiki? :-)
> 
> I don't know how big c2 is.  My idea of a large wiki is Wikipedia.
> My guess is that c2 is smaller than that.

I just looked at c2; it has about 30k pages (I'd call this medium
sized) and finds incoming links pretty fast.  Is it using MoinMoin?
It doesn't look like other MoinMoin wikis that I know of.  I'd like to
think it's not finding those incoming links by scanning 30k separate
files in the file system.

Sometimes I think a wiki could get by with just a few large files.
Have one file containing all the wiki pages.  When someone adds or
updates a page, append the page contents to the end of the big file.
That might also be a good time to pre-render it, and put the rendered
version in the big file as well.  Also, take note of the byte position
in the big file (e.g. with ftell()) where the page starts.  Remember
that location in an in-memory structure (Python dict) indexed on the
page name.  Also, append the info to a second file.  Find the location
of that entry and store it in the in-memory structure as well.  Also,
if there was already a dict entry for that page, record a link to the
old offset in the 2nd file.  That means the previous revisions of a
file can be found by following the links backwards through the 2nd
file.  Finally, on restart, scan the 2nd file to rebuild the in-memory
structure.

With a Wikipedia-sized wiki, the in-memory structure will be a few
hundred MB and the 2nd file might be a few GB.  On current 64-bit
PC's, neither of these is a big deal.  The 1st file might be several
TB, which might not be so great; a better strategy is needed, left as
an exercise (various straightforward approaches suggest themselves).
Also, the active pages should be cached in ram.  For a small wiki (up
to 1-2 GB) that's no big deal, just let the OS kernel do it or use
some LRU scheme in the application.  For a large wiki, the cache and
possibly the page store might be spread across multiple servers using
some pseudo-RDMA scheme.

I think the WikiMedia software is sort of barely able to support
Wikipedia right now, but it's pushing its scaling limits.  Within a
year or two, if the limits can be removed, Wikipedia is likely to
reach at least 10 times its present size and 1000 times its traffic
volume.  So the question of how to implement big, high-traffic wikis
has been on my mind lately.  Specifically I ask myself how Google
would do it.  I think it's quite feasible to write Wiki software that
can handle this amount of load, but none of the current stuff can
really do it.