Comparing caching strategies

Peter J. Holzer hjp-python at hjp.at
Fri Feb 17 13:47:02 EST 2023


On 2023-02-17 00:07:12 -0500, avi.e.gross at gmail.com wrote:
> Roaring bitmaps claim to be an improvement not only over uncompressed
> structures but some other compressed versions but my reading shows it
> may be limited to some uses. Bitsets in general seem to be useful only
> for a largely contiguous set of integers where each sequential bit
> represents whether the nth integer above the lowest is in the set or
> not.

They don't really have to be that contiguous. As long as your integers
fit into 32 bits you're fine.

> Of course, properly set up, this makes Unions and Intersections
> and some other operations fairly efficient. But sets are not the same
> as dictionaries and often you are storing other data types than
> smaller integers.

Of course. Different data types are useful for different applications.

> Many normal compression techniques can require lots of time to
> uncompress to find anything. My impression is that Roaring Bitmaps is
> a tad like the pkzip software that tries various compression
> techniques on each file and chooses whatever seems to work better on
> each one. That takes extra time when zipping, but when unzipping a
> file, it goes directly to the method used to compress it as the type
> is in a header and just expands it one way.

While not completely wrong, I don't think this comparison is very
helpful.


> My impression is that Roaring bitmaps breaks up the range of integers
> into smaller chunks and depending on what is being stored in that
> chunk, may leave it as an uncompressed bitmap, or a list of the sparse
> contents, or other storage methods and can search each version fairly
> quickly. 

It's been a few years since I looked at the implementation, but that's
the gist of it.


> So, I have no doubt it works great for some applications such as
> treating social security numbers as integers.

Not sure what you mean here. You mean storing sets of social security
numbers as roaring bitmaps? You might be able to do that (Last time I
looked RB's were limited to 32 bits, which may not be enough to
represent SSNs unmodified), but are set operations on SSNs something you
routinely do?

> It likely would be overkill to store something like the components of
> an IP address between 0 and 255 inclusive.
> 
> But having said that, there may well be non-integer data that can be
> mapped into and out of integers. As an example, airports or radio
> stations have names like LAX or WPIX. If you limit yourself to ASCII
> letters then every one of them can be stored as a 32-bit integer,
> perhaps with some padding.

Again: What would be the application?

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20230217/c0b46d76/attachment.sig>


More information about the Python-list mailing list