Comparing caching strategies

MRAB python at mrabarnett.plus.com
Sat Feb 18 15:34:41 EST 2023


On 2023-02-18 19:59, avi.e.gross at gmail.com wrote:
[snip]
> I do not know the internals of any Roaring Bitmap implementation so all I
> did gather was that once the problem is broken into accessing individual
> things I chose to call zones for want of a more specific name, then each
> zone is stored in one of an unknown number of ways depending on some logic.
> You say an implementation chose two ways and that is fine. But in theory, it
> may make sense to choose other ways as well and the basic outline of the
> algorithm remains similar. I can imagine if a region/zone is all the even
> numbers, then a function that checks if you are searching for an odd number
> may be cheaper. That is an example, not something I expect to see or that is
> useful enough. But the concept of storing a function as the determiner for a
> region is general enough that it can support many more realistic ideas.
> 
>   From what you wrote, the algorithm chosen is fairly simple BUT I have to ask
> if these bitmaps are static or can be changed at run time? I mean if you
> have a region that is sparse and then you keep adding, does the software
> pause and rewrite that region as a bitmap if it is a list of offsets? Or, if
> a bitmap loses enough, ...
> 
A region is either "sparse" (<= 4096 entries) or "dense" (> 4096 
entries). It'll be converted to the other type as and when necessary.

> On to your other points. Again, understand I am talking way more abstractly
> than you and thus it really does not matter what the length of a particular
> ID in some country is for the main discussion. The assumption is that if you
> are using something with limits, like a Roaring Bitmap, that you do things
> within the limits. When I lived in Austria, I did not bother getting an
> Austrian Sozialversicherungsnummer so I have no idea it is ten digits long.
> In any case, many things grow over time such as the length of telephone
> numbers.
> 
> The same applies to things like airport codes. They can get longer for many
> reasons and may well exceed 4 characters, and certainly UNICODE or other
> such representations may exceed four bytes now if you allow non-ASCII
> characters that map into multiple bytes. My point was to think about how
> useful a Roaring bitmap is if it takes only 32 bit integers and one trivial
> mapping was to use any four bytes to represent a unique integer. But clearly
> you could map longer strings easily enough if you restrict yourself to 26
> upper case letters and perhaps a few other symbols that can be encoded in 5
> bits. I am not saying it is worth the effort, but that means 6 characters
> can fit in 32 bits.
> 
The Unicode Consortium have said that Unicode will be limited to 
U+0000..U+10FFFF (21 bits).

> I do wonder if the basic idea has to be limited to 32 bits or if it can
> expand to say 64 or use additional but fast methods of storing the data
> beyond the two mentioned.
> 
> There are variants of other ideas I can think of like sparse arrays or
> matrices such as you find in the scipy module in Python. If they hold a
> Boolean value, they sound like they are a similar idea where you simply keep
> track of the ones marked True, or if it makes sense, the ones considered
> False.
> 
[snip]



More information about the Python-list mailing list