Special logging module needed

Laszlo Nagy gandalf at shopzeus.com
Wed Mar 23 10:37:04 EDT 2011


     Hi All,

We have a special problem that needs to be addressed. We need to have a 
database of log messages. These messages are not simple strings - they 
are data structures. The log messages are assigned to various objects. 
Objects can be identified with simple object identifiers (strings). We 
have about 10 million objects. Most of them only have one or two log 
messages assigned. Some of them (about 1M objects) will have 10 log 
messages appended each day. Fewer objects (100K) will have log messages 
appended even more frequently.

Requirements:

    * The database is mostly write-only. In most of the time, we will be
      adding messages continuously, and we won't read them back.
    * Appending new messages should be fast. At least 100 messages / second.
    * High availability - clients must be able to add new log messages
      and read back old messages anytime. The service must not stop.
    * Database is huge. We want to preserve data for one or two years
      and possibly it will grow to 1TB or something.
    * We should be able to drop older data (e.g. log messages appended
      more than 2 years ago) without trouble.
    * We need to implement incremental backups. For example, we could
      have a separate database file for every day in the current month,
      then database files for previous months etc. The main point is
      that we should be able to make backup copies incrementally, and
      the logging service should be available at the same time. (It is
      not curical to have an up-to-date backup from the current day, but
      we want to backup everything that happened more than a day ago).
    * Reading back log messages will happen rarely. Average would be 5
      times in every minute, and usually only for a given date range.
      (for example, all messages from 2010 May) But then it should be
      relatively fast. Even with a 1TB database, we should be able to
      fulfill such a request within one second.

What we where doing until now, is that we have used a directory 
structure. Every object had a separate CSV file assigned, and new 
messages where appended into those CSV files. We thought that appending 
new messages will be fast (fopen + fwrite should be fast) and reading 
them back sould also be fast. But now we are suffering from these problems:

    * We have millions of files on disk. Very hard to synchronize them,
      make incremental backup copies. It is just too slow.
    * Sometimes the OS cannot handle requests fast enough. There are too
      many (at least 5million) files on the file system and they are
      very fragmented. In some cases it takes 10 seconds or more to open
      a log file for appending. CPU goes up to 100% apparently when the
      OS tries to open the log file. (dirhash mem already increased to
      1GB but it's still not good enough.)

I was thinking about using a regular SQL database, but it has problems:

    * Not easy to do incremental backups
    * Not that easy to remove old data (e.g. cannot delete 100GB data
      with SQL without big slowdon)
    * With a btree index and some 100 million rows already in a table,
      appending messages may not be fast enough (because of heavy
      reindexing). I may be wrong about this.

I was also thinking about storing data in a gdbm database. One file for 
each month storing at most 100 log messages for every key value. Then 
one file for each day in the current month, storing one message for each 
key value. Incremental backup would be easy, and reading back old 
messages would be fast enough (just need to do a few hash lookups). 
However, implementing a high availability service around this is not 
that easy.

I'm just wondering if there is an open source solution for my problem 
that I don't know of.

Thanks,

    Laszlo



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110323/e53a3320/attachment.html>


More information about the Python-list mailing list