Comparing caching strategies

Peter J. Holzer hjp-python at hjp.at
Sat Feb 18 05:40:04 EST 2023


On 2023-02-17 18:08:09 -0500, avi.e.gross at gmail.com wrote:
> Analogies I am sharing are mainly for me to wrap my head around an idea by
> seeing if it matches any existing ideas or templates and is not meant to be
> exact. Fair enough?

Yeah. But if you are venting your musings into a public space you
shouldn't be surprised if people react to them. And we can only react to
what you write, not what you think.

> But in this case, from my reading, the analogy is rather reasonable.

Although that confused me a bit. You are clearly responding to something
I thought about but which you didn't quote below. Did I just think about
it and not write it, but you responded anyway because you're a mind
reader? Nope, it just turns out you (accidentally) deleted that sentence.


> The implementation of Roaring Bitmaps seems to logically break up the
> space of all possible values it looks up into multiple "zones" that
> are somewhat analogous to individual files,

That part is correct. But you presented that in the form of a
performance/space tradeoff, writing about "trying multiple methods" to
find the smallest, and that that makes compression slower. That may be
the case for pkzip, but it's not what RB does: Instead it uses a very
simple heuristic: If there are less than 25% of the bits set in a zone,
it uses a list of offsets, otherwise a plain bitmap. (according to
their 2016 paper which I just skimmed through again - maybe the
implementation is a bit more complex now). So I think your description
would lead the reader to anticipate problems which aren't there and
probably miss ones which are there. So I'll stay with my "not completely
wrong but not very helpful" assessment.


> I did not raise the issue and thus have no interest in promoting this
> technology nor knocking it down. Just wondering what it was under the hood
> and whether I might even have a need for it. I am not saying Social Security
> numbers are a fit, simply that some form of ID number might fit.

Yeah, that's the point: Any form of ID which is a small-ish integer
number fits.

And maybe it's just because I work with databases a lot, but
representing things with numeric ids (in relational databases we call
them "surrogate keys") is such a basic operation that it doesn't warrant
more than a sentence or two.


> If a company has a unique ID number for each client and uses it
> consistently, then an implementation that holds a set stored this way
> of people using product A, such as house insurance, and those using
> product B, such as car insurance, and perhaps product C is an umbrella
> policy, might easily handle some queries such as who uses two or all
> three (intersections of sets) or who might merit a letter telling them
> how much they can save if they subscribed to two or all three as a way
> to get more business. Again, just  a made-up example I can think
> about. A company which has a million customers over the years will
> have fairly large sets as described. 

A much better example. This is indeed how you would use roaring bitmaps.


> What is helpful to me in thinking about something will naturally often not
> be helpful to you or others but nothing you wrote makes me feel my first
> take was in any serious way wrong. It still makes sense to me.
> 
> And FYI, the largest integer in signed 32 bits is 2_147_483_647

I know. It's been some time since I could do hexadecimal arithmetic
in my head but the the limits of common data types are still burned into
my brain ;-).

> which is 10 digits. A Social Security number look like xxx-xx-xxxx at
> this time which is only 9 digits.

These are US social security numbers. Other countries have other
schemes. For example, Austrian SSNs have 10 digits, so you would need 34
bits to represent them exactly. However, they (obviously) contain some
redundancy (one of the digits is a checksum, and there aren't 999999
days in a century) so it could algorithmically be compressed down to 26
bits. But you probably wouldn't do that because in almost any real
application you wouldn't use the SSN as a primary key (some people don't
have one, and there have been mixups resulting in two people getting the
same SSN).


> As for my other EXAMPLE, I fail to see why I need to provide a specific need
> for an application. I don't care what they need it for. The thought was
> about whether something that does not start as an integer can be uniquely
> mapped into and out of integers of size 32 bits.

That's what confused me. You seemed to concentrate on the "map things to
integers" part which has been solved for decades and is absolutely
incidental to roaring bitmaps and completely ignored what you would be
using them for.

So I thought I was missing something, but it seems I wasn't.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20230218/f69c50d3/attachment.sig>


More information about the Python-list mailing list