Comparing caching strategies

Sat Feb 18 14:59:37 EST 2023

David,

This conversation strikes me as getting antagonistic and as such, I will not
continue it here after this message.

I can nitpick at least as well as you but have no interest. It is also
wandering away from the original point.

The analogy I gave remains VALID no matter if you do not accept it as being
precise. Analogies are not equalities. I did not say this does at all what
programs like pkzip do in entirety or even anything similarly. I simply said
that pkzip (as it no doubt evolves) has a series of compression methods to
choose from and each has a corresponding method to undo and get back the
original. As it happens, you can write any number of algorithms that
determine which method to use and get back the original. It depend on many
relative costs including not just the size of the compressed object but how
often it will be accessed and how much effort each takes. In some cases you
will compress everything once and extract it many times and in many places,
so it may be worth the effort to try various compression techniques and
measure size and even the effort to extract from that and decide which to
use.

I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual
things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic.
You say an implementation chose two ways and that is fine. But in theory, it
may make sense to choose other ways as well and the basic outline of the
algorithm remains similar. I can imagine if a region/zone is all the even
numbers, then a function that checks if you are searching for an odd number
may be cheaper. That is an example, not something I expect to see or that is
useful enough. But the concept of storing a function as the determiner for a
region is general enough that it can support many more realistic ideas.

>From what you wrote, the algorithm chosen is fairly simple BUT I have to ask
if these bitmaps are static or can be changed at run time? I mean if you
have a region that is sparse and then you keep adding, does the software
pause and rewrite that region as a bitmap if it is a list of offsets? Or, if
a bitmap loses enough, ...

On to your other points. Again, understand I am talking way more abstractly
than you and thus it really does not matter what the length of a particular
ID in some country is for the main discussion. The assumption is that if you
are using something with limits, like a Roaring Bitmap, that you do things
within the limits. When I lived in Austria, I did not bother getting an
Austrian Sozialversicherungsnummer so I have no idea it is ten digits long.
In any case, many things grow over time such as the length of telephone
numbers. 

The same applies to things like airport codes. They can get longer for many
reasons and may well exceed 4 characters, and certainly UNICODE or other
such representations may exceed four bytes now if you allow non-ASCII
characters that map into multiple bytes. My point was to think about how
useful a Roaring bitmap is if it takes only 32 bit integers and one trivial
mapping was to use any four bytes to represent a unique integer. But clearly
you could map longer strings easily enough if you restrict yourself to 26
upper case letters and perhaps a few other symbols that can be encoded in 5
bits. I am not saying it is worth the effort, but that means 6 characters
can fit in 32 bits.

I do wonder if the basic idea has to be limited to 32 bits or if it can
expand to say 64 or use additional but fast methods of storing the data
beyond the two mentioned.

There are variants of other ideas I can think of like sparse arrays or
matrices such as you find in the scipy module in Python. If they hold a
Boolean value, they sound like they are a similar idea where you simply keep
track of the ones marked True, or if it makes sense, the ones considered
False.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Peter J. Holzer
Sent: Saturday, February 18, 2023 5:40 AM
To: python-list at python.org
Subject: Re: Comparing caching strategies

On 2023-02-17 18:08:09 -0500, avi.e.gross at gmail.com wrote:
> Analogies I am sharing are mainly for me to wrap my head around an 
> idea by seeing if it matches any existing ideas or templates and is 
> not meant to be exact. Fair enough?

Yeah. But if you are venting your musings into a public space you shouldn't
be surprised if people react to them. And we can only react to what you
write, not what you think.

> But in this case, from my reading, the analogy is rather reasonable.

Although that confused me a bit. You are clearly responding to something I
thought about but which you didn't quote below. Did I just think about it
and not write it, but you responded anyway because you're a mind reader?
Nope, it just turns out you (accidentally) deleted that sentence.

> The implementation of Roaring Bitmaps seems to logically break up the 
> space of all possible values it looks up into multiple "zones" that 
> are somewhat analogous to individual files,

That part is correct. But you presented that in the form of a
performance/space tradeoff, writing about "trying multiple methods" to find
the smallest, and that that makes compression slower. That may be the case
for pkzip, but it's not what RB does: Instead it uses a very simple
heuristic: If there are less than 25% of the bits set in a zone, it uses a
list of offsets, otherwise a plain bitmap. (according to their 2016 paper
which I just skimmed through again - maybe the implementation is a bit more
complex now). So I think your description would lead the reader to
anticipate problems which aren't there and probably miss ones which are
there. So I'll stay with my "not completely wrong but not very helpful"
assessment.

> I did not raise the issue and thus have no interest in promoting this 
> technology nor knocking it down. Just wondering what it was under the 
> hood and whether I might even have a need for it. I am not saying 
> Social Security numbers are a fit, simply that some form of ID number
might fit.

Yeah, that's the point: Any form of ID which is a small-ish integer number
fits.

And maybe it's just because I work with databases a lot, but representing
things with numeric ids (in relational databases we call them "surrogate
keys") is such a basic operation that it doesn't warrant more than a
sentence or two.

> If a company has a unique ID number for each client and uses it 
> consistently, then an implementation that holds a set stored this way 
> of people using product A, such as house insurance, and those using 
> product B, such as car insurance, and perhaps product C is an umbrella 
> policy, might easily handle some queries such as who uses two or all 
> three (intersections of sets) or who might merit a letter telling them 
> how much they can save if they subscribed to two or all three as a way 
> to get more business. Again, just  a made-up example I can think 
> about. A company which has a million customers over the years will 
> have fairly large sets as described.

A much better example. This is indeed how you would use roaring bitmaps.

> What is helpful to me in thinking about something will naturally often 
> not be helpful to you or others but nothing you wrote makes me feel my 
> first take was in any serious way wrong. It still makes sense to me.
> 
> And FYI, the largest integer in signed 32 bits is 2_147_483_647

I know. It's been some time since I could do hexadecimal arithmetic in my
head but the the limits of common data types are still burned into my brain
;-).

> which is 10 digits. A Social Security number look like xxx-xx-xxxx at 
> this time which is only 9 digits.

These are US social security numbers. Other countries have other schemes.
For example, Austrian SSNs have 10 digits, so you would need 34 bits to
represent them exactly. However, they (obviously) contain some redundancy
(one of the digits is a checksum, and there aren't 999999 days in a century)
so it could algorithmically be compressed down to 26 bits. But you probably
wouldn't do that because in almost any real application you wouldn't use the
SSN as a primary key (some people don't have one, and there have been mixups
resulting in two people getting the same SSN).

> As for my other EXAMPLE, I fail to see why I need to provide a 
> specific need for an application. I don't care what they need it for. 
> The thought was about whether something that does not start as an 
> integer can be uniquely mapped into and out of integers of size 32 bits.

That's what confused me. You seemed to concentrate on the "map things to
integers" part which has been solved for decades and is absolutely
incidental to roaring bitmaps and completely ignored what you would be using
them for.

So I thought I was missing something, but it seems I wasn't.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"