Most space-efficient way to store log entries

Wed Oct 28 23:28:31 EDT 2015

Hello Marc,

I think you have gotten quite a few answers already, but I'll add my 
voice.

> I'm writting an application that saves historical state in a log 
> file.

If I were in your shoes, I'd probably use the logging module rather 
than saving state in my own log file.  That allows the application 
to send all historical state to the system log.  Then, it could be 
captured, recorded, analyzed and purged (or neglected) along with 
all of the other logging.

But, this may not be appropriate for your setup.  See also my final 
two questions at the bottom.

> I want to be really efficient in terms of used bytes.

It is good to want to be efficient.  Don't cost your (future) self 
or some other poor schlub future working or computational 
efficiency, though!

Somebody may one day want to extract utility out of the 
application's log data.  So, don't make that data too hard to read.

> What I'm doing now is:
> 
> 1) First use zlib.compress

... assuming you are going to write your own files, then, certainly.

If you also want better compression (quantified in a table below) at 
a higher CPU cost, try bz2 or lzma (Python3).  Note that there is 
not a symmetric CPU cost for compression and decompression.  
Usually, decompression is much cheaper.

  # compress = bz2.compress
  # compress = lzma.compress
  compress = zlib.compress

To read the logging data, then the programmer, application analyst 
or sysadmin will need to spend CPU to uncompress.  If it's rare, 
that's probably a good tradeoff.

Here's my small comparison matrix of the time it takes to transform 
a sample log file that was roughly 33MB (in memory, no I/O costs 
included in timing data).  The chart also shows the size of the 
compressed data, in bytes and percentage (to demonstrate compression 
efficiency).

   format                    bytes   pct  walltime
   raw                    34311602 1.00%  0.00000s
   base64-encode          46350762 1.35%  0.43066s
   zlib-compress           3585508 0.10%  0.54773s
   bz2-compress            2704835 0.08%  4.15996s
   lzma-compress           2243172 0.07% 15.89323s
   base64-decode          34311602 1.00%  0.18933s
   bz2-decompress         34311602 1.00%  0.62733s
   lzma-decompress        34311602 1.00%  0.22761s
   zlib-decompress        34311602 1.00%  0.07396s

The point of a sample matrix like this is to examine the tradeoff 
between time (for compression and decompression) and to think about 
how often you, your application or your users will decompress the 
historical data.  Also consider exactly how sensitive you are to 
bytes on disk.  (N.B. Data from a single run of the code.)

Finally, simply make a choice for one of the compression algorithms.

> 2) And then remove all new lines using binascii.b2a_base64, so I 
> have a log entry per line.

I'd also suggest that you resist the base64 temptation.  As others 
have pointed out, there's a benefit to keeping the logs compressed 
using one of the standard compression tools (zgrep, zcat, bzgrep, 
lzmagrep, xzgrep, etc.)

Also, see the statistics above for proof--base64 encoding is not 
compression.  Rather, it usually expands input data to the tune of 
one third (see above, the base64 encoded string is 135% of the raw 
input).

That's not compression.  So, don't do it.  In this case, it's 
expansion and obfuscation.  If you don't need it, don't choose it.

In short, base64 is actively preventing you from shrinking your 
storage requirement.

> but b2a_base64 is far from ideal: adds lots of bytes to the 
> compressed log entry. So, I wonder if perhaps there is a better 
> way to remove new lines from the zlib output? or maybe a different 
> approach?

Suggestion:  Don't worry about the single-byte newline terminator.  
Look at a whole logfile and choose your best option.

Lastly, I have one other pair of questions for you to consider.

Question one:  Will your application later read or use the logging 
data?  If no, and it is intended only as a record for posterity, 
then, I'd suggest sending that data to the system logs (see the 
'logging' module and talk to your operational people).

If yes, then question two is:  What about resilience?  Suppose your 
application crashes in the middle of writing a (compressed) logfile.  
What does it do?  Does it open the same file?  (My personal answer 
is always 'no.')  Does it open a new file?  When reading the older 
logfiles, how does it know where to resume?  Perhaps you can see my 
line of thinking.

Anyway, best of luck,

-Martin

P.S. The exact compression ratio is dependent on the input.  I have 
  rarely seen zlib at 10% or bz2 at 8%.  I conclude that my sample 
  log data must have been more homogeneous than the data on which I 
  derived my mental bookmarks for textual compression efficiencies 
  of around 15% for zlib and 12% for bz2.  I have no mental bookmark 
  for lzma yet, but 7% is an outrageously good compression ratio.

-- 
Martin A. Brown
http://linux-ip.net/