Comparing caching strategies

avi.e.gross at gmail.com avi.e.gross at gmail.com
Fri Feb 17 00:07:12 EST 2023


I am less interested in the choice of names than the pro and con of when these Roaring bitmaps are worth using and when they are not.

It is a bit like discussing whether various compression techniques are worth using as the storage or memory costs can be weighed against the CPU or transient memory costs of compressing and uncompressing.  The value tends to depend on many factors and there may even be times you want to store the data in multiple data structures with each optimized to store that kind or amount of data.

Roaring bitmaps claim to be an improvement not only over uncompressed structures but some other compressed versions but my reading shows it may be limited to some uses. Bitsets in general seem to be useful only for a largely contiguous set of integers where each sequential bit represents whether the nth integer above the lowest is in the set or not. Of course, properly set up, this makes Unions and Intersections and some other operations fairly efficient. But sets are not the same as dictionaries and often you are storing other data types than smaller integers.

Many normal compression techniques can require lots of time to uncompress to find anything. My impression is that Roaring Bitmaps is a tad like the pkzip software that tries various compression techniques on each file and chooses whatever seems to work better on each one. That takes extra time when zipping, but when unzipping a file, it goes directly to the method used to compress it as the type is in a header and just expands it one way.

My impression is that Roaring bitmaps breaks up the range of integers into smaller chunks and depending on what is being stored in that chunk, may leave it as an uncompressed bitmap, or a list of the sparse contents, or other storage methods and can search each version fairly quickly. 

So, I have no doubt it works great for some applications such as treating social security numbers as integers. It likely would be overkill to store something like the components of an IP address between 0 and 255 inclusive.

But having said that, there may well be non-integer data that can be mapped into and out of integers. As an example, airports or radio stations have names like LAX or WPIX. If you limit yourself to ASCII letters then every one of them can be stored as a 32-bit integer, perhaps with some padding. Of course for such fairly simple data, some might choose to place the data in a balanced tree structure and get reasonable search speed.

I am curious about the size of some of these structures but obviously it depends. Are they stored on disk in this form too?


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of MRAB
Sent: Thursday, February 16, 2023 11:24 PM
To: python-list at python.org
Subject: Re: Comparing caching strategies

On 2023-02-14 22:20, Rob Cliffe via Python-list wrote:
> On 11/02/2023 00:39, Dino wrote:
>> First off, a big shout out to Peter J. Holzer, who mentioned roaring 
>> bitmaps a few days ago and led me to quite a discovery.
>>
> I was intrigued to hear about roaring bitmaps and discover they really 
> were a thing (not a typo as I suspected at first).
> What next, I wonder?
>       argumentative arrays
>       chattering classes (on second thoughts, we have those already)
>       dancing dictionaries
>       garrulous generators
>       laughing lists
>       piping pipelines
>       singing strings
>       speaking sets
>       stuttering sorts
>       talking tuples
>       whistling walruses?

babbling bytestrings?

> The future awaits [pun not intended] ...

--
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list