Comparing caching strategies

avi.e.gross at gmail.com avi.e.gross at gmail.com
Fri Feb 17 18:08:09 EST 2023


Peter,

Analogies I am sharing are mainly for me to wrap my head around an idea by
seeing if it matches any existing ideas or templates and is not meant to be
exact. Fair enough?

But in this case, from my reading, the analogy is rather reasonable. The
implementation of Roaring Bitmaps seems to logically break up the space of
all possible values it looks up into multiple "zones" that are somewhat
analogous to individual files, albeit probably in memory as the program
runs. In the pkzip analogy, each file is processed and stored independently
alongside a header that provides enough detail to uniquely open up the
contents when needed. The collection of such compressed files is then one
bigger file that is saved that uses up less space. Roaring bitmaps seems to
determine how best to store each zone and only processes that zone when
requested, hence some of the speedup as each zone is stored in a manner that
generally allows fast access.

I did not raise the issue and thus have no interest in promoting this
technology nor knocking it down. Just wondering what it was under the hood
and whether I might even have a need for it. I am not saying Social Security
numbers are a fit, simply that some form of ID number might fit. If a
company has a unique ID number for each client and uses it consistently,
then an implementation that holds a set stored this way of people using
product A, such as house insurance, and those using product B, such as car
insurance, and perhaps product C is an umbrella policy, might easily handle
some queries such as who uses two or all three (intersections of sets) or
who might merit a letter telling them how much they can save if they
subscribed to two or all three as a way to get more business. Again, just  a
made-up example I can think about. A company which has a million customers
over the years will have fairly large sets as described. 

What is helpful to me in thinking about something will naturally often not
be helpful to you or others but nothing you wrote makes me feel my first
take was in any serious way wrong. It still makes sense to me.

And FYI, the largest integer in signed 32 bits is 2_147_483_647 which is 10
digits. A Social Security number look like xxx-xx-xxxx at this time which is
only 9 digits.  Not that it matters, but it seems it still qualifies to be
used as I describe, as long as Roaring bitmaps allows it, minus any spaces
or hyphens and converted to an integer.

As for my other EXAMPLE, I fail to see why I need to provide a specific need
for an application. I don't care what they need it for. The thought was
about whether something that does not start as an integer can be uniquely
mapped into and out of integers of size 32 bits. So I considered a few
examples of short textual items such as three letter airport abbreviations.
But if you cannot imagine an application, consider one similar enough to the
above.  I think there are currently over 47,000 such airports  in the world
and apparently about 20,000 in the US. That seems to exceed the possible
combinations of 26 letters (cubed) so it seems there are 4-letter codes too
such as ZSPD. It should still fit into 4 bytes, for now.

So assume you have a variety of facts such as which airports have
handicapped accessible bathrooms, or have an extra long/strong runway that
can handle huge planes or anything else that is worth knowing. You might
have bitmaps (as is being discussed) that may be sparse for some such info
and fairly dense for other info like having jet fuel available. As above,
finding an airport that has various mixtures may be doable with these sets
and perhaps faster than say queries on a relational database storing the
same info.

I will end by saying this is a hypothetical discussion for me. I am not
doing any projects now where I expect to use Roaring bitmaps but am now
aware of them should any need or opportunity arise. My mind is very full
with such trivia and very little is needed albeit I never know what may come
in as useful. 

Respectfully,

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Peter J. Holzer
Sent: Friday, February 17, 2023 1:47 PM
To: python-list at python.org
Subject: Re: Comparing caching strategies

On 2023-02-17 00:07:12 -0500, avi.e.gross at gmail.com wrote:
> Roaring bitmaps claim to be an improvement not only over uncompressed 
> structures but some other compressed versions but my reading shows it 
> may be limited to some uses. Bitsets in general seem to be useful only 
> for a largely contiguous set of integers where each sequential bit 
> represents whether the nth integer above the lowest is in the set or 
> not.

They don't really have to be that contiguous. As long as your integers fit
into 32 bits you're fine.

> Of course, properly set up, this makes Unions and Intersections and 
> some other operations fairly efficient. But sets are not the same as 
> dictionaries and often you are storing other data types than smaller 
> integers.

Of course. Different data types are useful for different applications.

> Many normal compression techniques can require lots of time to 
> uncompress to find anything. My impression is that Roaring Bitmaps is 
> a tad like the pkzip software that tries various compression 
> techniques on each file and chooses whatever seems to work better on 
> each one. That takes extra time when zipping, but when unzipping a 
> file, it goes directly to the method used to compress it as the type 
> is in a header and just expands it one way.



> My impression is that Roaring bitmaps breaks up the range of integers 
> into smaller chunks and depending on what is being stored in that 
> chunk, may leave it as an uncompressed bitmap, or a list of the sparse 
> contents, or other storage methods and can search each version fairly 
> quickly.

It's been a few years since I looked at the implementation, but that's the
gist of it.


> So, I have no doubt it works great for some applications such as 
> treating social security numbers as integers.

Not sure what you mean here. You mean storing sets of social security
numbers as roaring bitmaps? You might be able to do that (Last time I looked
RB's were limited to 32 bits, which may not be enough to represent SSNs
unmodified), but are set operations on SSNs something you routinely do?

> It likely would be overkill to store something like the components of 
> an IP address between 0 and 255 inclusive.
> 
> But having said that, there may well be non-integer data that can be 
> mapped into and out of integers. As an example, airports or radio 
> stations have names like LAX or WPIX. If you limit yourself to ASCII 
> letters then every one of them can be stored as a 32-bit integer, 
> perhaps with some padding.

Again: What would be the application?

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"



More information about the Python-list mailing list