[pypy-dev] vmprof compression

Leonardo Santagada santagada at gmail.com
Sat Mar 28 02:15:42 CET 2015


snappy and lz4 are good algos to try too.

On Fri, Mar 27, 2015 at 8:55 AM, Maciej Fijalkowski <fijall at gmail.com>
wrote:

> yeah I think putting gzip or something in the loop is A LOT easier :-)
>
> On Thu, Mar 26, 2015 at 6:29 PM, John Camara <john.m.camara at gmail.com>
> wrote:
> > Hi Fijal,
> >
> > To recap and continue the discussion from irc.
> >
> > We already discussed that the stack id are based on a counter which is
> good
> > but I also want to confirm that the ids have locality associated with the
> > code.  That is similar areas of the code will have similar ids.  Just to
> > make sure are not random with respect to the code otherwise compression
> will
> > not be helpful.  If the ids are random that would need to be corrected
> > first.
> >
> > Right now the stack traces are written to the file repeating the
> following
> > sequence
> >
> > MARKER_STACKTRACE
> > count
> > depth
> > stack
> > stack
> > ...
> > stack
> >
> > In order to get a high compression ratio it would be better to combine
> > multiple stacktraces and rearrange the data as follows
> >
> > MARKER_COMPRESSED_STACKTRACES
> > counts_compressed_length
> > counts_compressed
> > depths_compressed_length
> > depths_compressed
> > stacks_compressed_length
> > stacks_compressed
> >
> > In order to build the compress data you will want to 3 pairs of 2
> buffers.
> > A pair of buffers for counts, depths, and stacks.  Your profiller would
> be
> > writing to one set of buffers and another thread would be responsible for
> > compressing buffers that are full and writing them to the file.  Once a
> set
> > of buffers are full the profiller would start filling up the other set of
> > buffers.
> >
> > For each set of buffers you need a variable to hold the previous count,
> > depth, and stack id.  They will be initialized to 0 before any data is
> > written to an empty buffer.  In stead of writing the actual count value
> into
> > the counts buffer you will write the difference between the current count
> > and the previous count.  The reason for doing this is that the delta
> values
> > will mostly be around 0 which will significantly improve the compression
> > ratio without adding much overhead.  Of course you would do the same for
> > depths and stack ids.
> >
> > When you compress the data you compress each buffer individually to make
> > sure like data is being compressed.  Like data compresses better the
> unlike
> > data and by saving deltas very few bits will be required to represent the
> > data and you are likely to have long strings of 0s and 1s.
> >
> > I'm sure now you can see why I don't want stack ids being random.  As if
> > they are random then the deltas will be all over the place so you wont
> end
> > up with long strings of 0s and 1s and random data itself does not
> compress.
> >
> > To test this out I wouldn't bother modifying the c code but instead try
> it
> > out in Python to first make sure the compression is providing huge gains
> and
> > figure out how to tune the algorithm without having to mess with the
> signal
> > handlers and writing the code for the separate thread and dealing issues
> > such as making sure you don't start writing to a buffer before the thread
> > finished writing the data to the file, etc.  I would just read an
> existing
> > profile file and rewrite it to a different file by rearranging the data
> and
> > compressing the delta as I described.  You can get away with one set of
> > buffers as you wouldn't be profiling at the same time.
> >
> > To tune this process you will need to determine the appropriate number of
> > stack traces that is small enough to keep memory down but large enough so
> > that the overhead associated with compression small.  Maybe start of with
> > about 8000 stack traces.  I would try gzip, bz2, and lzma and look at
> their
> > compression ratios and times.  Gzip is general faster than bz2 and lzma
> is
> > the slowest.  On the other hand lzma provides the best compression and
> gzip
> > the worse.  Since you will be compressing deltas you most likely can get
> > away with using the fastest compression options under each compressor and
> > not effect the compression ratio.  But I would test it to verify this as
> it
> > does depend on the data being compressed whether or not this is true.
> Also
> > one option that is available in lzma is the ability to set the width of
> the
> > data to look at when looking for patterns.  Since you are saving 32 or 64
> > bit ints depending on the platform you can set the option to either 4 or
> 8
> > bytes based on the platform.  I don't believe qzip or bz2 have this
> option.
> > By setting this option in lzma you will likely improve the compression
> > ratio.
> >
> > You may find that counts and depths give similar compression, between
> the 3
> > compression types in which case just use the fastest which will likely be
> > gzip.  On the other hand maybe the stack ids will be better off using
> lzma.
> > This is also another reason to separate out, like data, as it gives you
> an
> > option to use the fastest compressors for some data types while using
> others
> > to provide for better compression.
> >
> > I would not be surprised if this approach achieves a compression ratio
> > better than 100x but that will be heavily dependent on how local the
> stack
> > ids are.  Also don't forget about simple things like not using 64 bit
> ints
> > when you can get away with smaller ones.
> >
> > Also for a slight variation to the above.  If you find most of your
> deltas
> > are < 127 you could write them out as 1 byte and when greater than 127
> write
> > them out as a 4 byte int with the high bit set.  If you do this then
> don't
> > set the lzma option to 4 or 8 byte boundaries as now your data is a
> mixture
> > of 1 and 4 byte values.  This sometimes can provide huge reductions in
> > compression times without much effect on the overall compression ratio.
> >
> > Hope you find this helpful.
> >
> > John
> >
> > _______________________________________________
> > pypy-dev mailing list
> > pypy-dev at python.org
> > https://mail.python.org/mailman/listinfo/pypy-dev
> >
> _______________________________________________
> pypy-dev mailing list
> pypy-dev at python.org
> https://mail.python.org/mailman/listinfo/pypy-dev
>



-- 

Leonardo Santagada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20150327/8d97d521/attachment-0001.html>


More information about the pypy-dev mailing list