Most space-efficient way to store log entries
Martin A. Brown
martin at linux-ip.net
Wed Oct 28 23:28:31 EDT 2015
Hello Marc,
I think you have gotten quite a few answers already, but I'll add my
voice.
> I'm writting an application that saves historical state in a log
> file.
If I were in your shoes, I'd probably use the logging module rather
than saving state in my own log file. That allows the application
to send all historical state to the system log. Then, it could be
captured, recorded, analyzed and purged (or neglected) along with
all of the other logging.
But, this may not be appropriate for your setup. See also my final
two questions at the bottom.
> I want to be really efficient in terms of used bytes.
It is good to want to be efficient. Don't cost your (future) self
or some other poor schlub future working or computational
efficiency, though!
Somebody may one day want to extract utility out of the
application's log data. So, don't make that data too hard to read.
> What I'm doing now is:
>
> 1) First use zlib.compress
... assuming you are going to write your own files, then, certainly.
If you also want better compression (quantified in a table below) at
a higher CPU cost, try bz2 or lzma (Python3). Note that there is
not a symmetric CPU cost for compression and decompression.
Usually, decompression is much cheaper.
# compress = bz2.compress
# compress = lzma.compress
compress = zlib.compress
To read the logging data, then the programmer, application analyst
or sysadmin will need to spend CPU to uncompress. If it's rare,
that's probably a good tradeoff.
Here's my small comparison matrix of the time it takes to transform
a sample log file that was roughly 33MB (in memory, no I/O costs
included in timing data). The chart also shows the size of the
compressed data, in bytes and percentage (to demonstrate compression
efficiency).
format bytes pct walltime
raw 34311602 1.00% 0.00000s
base64-encode 46350762 1.35% 0.43066s
zlib-compress 3585508 0.10% 0.54773s
bz2-compress 2704835 0.08% 4.15996s
lzma-compress 2243172 0.07% 15.89323s
base64-decode 34311602 1.00% 0.18933s
bz2-decompress 34311602 1.00% 0.62733s
lzma-decompress 34311602 1.00% 0.22761s
zlib-decompress 34311602 1.00% 0.07396s
The point of a sample matrix like this is to examine the tradeoff
between time (for compression and decompression) and to think about
how often you, your application or your users will decompress the
historical data. Also consider exactly how sensitive you are to
bytes on disk. (N.B. Data from a single run of the code.)
Finally, simply make a choice for one of the compression algorithms.
> 2) And then remove all new lines using binascii.b2a_base64, so I
> have a log entry per line.
I'd also suggest that you resist the base64 temptation. As others
have pointed out, there's a benefit to keeping the logs compressed
using one of the standard compression tools (zgrep, zcat, bzgrep,
lzmagrep, xzgrep, etc.)
Also, see the statistics above for proof--base64 encoding is not
compression. Rather, it usually expands input data to the tune of
one third (see above, the base64 encoded string is 135% of the raw
input).
That's not compression. So, don't do it. In this case, it's
expansion and obfuscation. If you don't need it, don't choose it.
In short, base64 is actively preventing you from shrinking your
storage requirement.
> but b2a_base64 is far from ideal: adds lots of bytes to the
> compressed log entry. So, I wonder if perhaps there is a better
> way to remove new lines from the zlib output? or maybe a different
> approach?
Suggestion: Don't worry about the single-byte newline terminator.
Look at a whole logfile and choose your best option.
Lastly, I have one other pair of questions for you to consider.
Question one: Will your application later read or use the logging
data? If no, and it is intended only as a record for posterity,
then, I'd suggest sending that data to the system logs (see the
'logging' module and talk to your operational people).
If yes, then question two is: What about resilience? Suppose your
application crashes in the middle of writing a (compressed) logfile.
What does it do? Does it open the same file? (My personal answer
is always 'no.') Does it open a new file? When reading the older
logfiles, how does it know where to resume? Perhaps you can see my
line of thinking.
Anyway, best of luck,
-Martin
P.S. The exact compression ratio is dependent on the input. I have
rarely seen zlib at 10% or bz2 at 8%. I conclude that my sample
log data must have been more homogeneous than the data on which I
derived my mental bookmarks for textual compression efficiencies
of around 15% for zlib and 12% for bz2. I have no mental bookmark
for lzma yet, but 7% is an outrageously good compression ratio.
--
Martin A. Brown
http://linux-ip.net/
More information about the Python-list
mailing list