Comparing caching strategies

Sat Feb 18 18:04:25 EST 2023

It is not an unusual pattern, Thomas, to do something selective to some object rather than do all parts just one way.

The history of computing has often been one where you had to deal with scarcity of expensive resources.

Consider the Python "list" as a rather wasteful container that is best used for diverse contents. As soon as you make all the contents of the same basic type, you start wondering if you are better off with a restricted type that holds only one kind of item, often as something like a numpy array. Or, you start wondering if maybe a good way to store the contents is in a dictionary instead, as searching it is often way faster.

But fundamentally, there is nothing WRONG with uses of a Python list as it handles almost anything and for small sample sizes, it is often not work spending time redoing it. Still, if you implement a 3-D matrix as a list of lists of lists, and use slicing and other techniques to do things like matrix multiplication, it gets unwieldly enough so arguably it is the wrong tool.

If you look at the history of Python, they deliberately left out many of the containers other programming languages used and stressed lists and tuples. No vectors or arrays were the focus. Later stuff had to be added as people noted that generality has costs. If you look at a language like JavaScript, it is beyond weird how they decided to use attributes of an object to make an array. So an object might have attributes like "name" and "length" alongside some like "1" and "2" and "666" and when you wanted to treat the object like an array, it might look at the attributes and ignore the non-numerical ones and look like it has a sparsely populated array/vector of items indexed from 1 to 666. You can do all kinds of arithmetic operations and add missing indices or remove some and it still works like a typical array but with weird costs such as sometimes having to reindex lots of items if you want to insert a new item as the 3rd or 1st element so any above need renumbering the hard way! It works but strikes me as a kludge.

If you look within Python  at numpy and pandas and similar utilities, they are well aware of the idea of abstracting out concepts with different implementations and use lots of Python tools that access different storage methods as needed often by using objects that implement various "protocols" to the point where manipulating them similarly seems straightforward. In principle, you could create something like a pandas data.frame and then call a function that examines the columns, and perhaps rows, and returns a modified (or new) data.frame where the contents have been adjusted so the storage is smaller or can be accessed more efficiently, based on any other preferences and hints you supply, such as if it can be immutable. A column which is all 1's or Trues can obviously be done other ways, and one that has all values under 256, again, can use less storage. 

Of course this would need to be done very carefully so that changes that violate assumptions will result in a refactoring to a version that handles the changes, or results in an error. And a long-running system could be set to keep track of how an object is used and perhaps make adjustments. As one example, after some number of accesses, you might change a function "live" to begin caching, or have an object reconfigure to be faster to search even if occupying more room.

Back to Python lists, a typical change is simply to convert them to a tuple which makes them easier to store and often faster to search. And, if the keys are unique, now that dictionaries maintain order of insertion, sometimes you may want to convert the list to a dict. 

But I hasten to note some "improvements" may not really improve. In a language like R, many operations such as changing one vector in a data.frame are done lazily and the new copy is still sharing mostly the same storage as the old. Making some changes can result in negative effects. A common problem people have is that trying to store some objects in an existing vector can work except when done, the entire vector has been transformed into one that can carry any of the contents. A vector of integers may become a vector of doubles or even a vector of characters that now have entries like "777" as a character string. The flexibility comes with lots of programming errors!

Note how this can cause problems with the original idea here of caching strategies. Imagine a function that checks the environment as to what encoding or human language and so on to produce text in. If you cache it so it produces results that are stored in something like a dictionary with a key, and later the user changes the environment as it continues running, the cache may now contain invalid results. You might need to keep track of the environment and empty the cache if things change, or start a new cache and switch to it.  An example would be the way I use Google Translate. I sometimes am studying or using a language and want to look up a word or phrase or entire sentence. If Google Translate keeps track, it may notice repeated requests like "Valentines Day" and cache it for re-use. But I often click to switch languages and see if the other one uses a similar or different way to describe what it means or something similar but spelled another way. German does the latter as in Valentinstag which is a fairly literal translation as does Dutch (Valentijnsdag ) and  Hungarian (Valentin nap) . 

But Hebrew calls it the holiday of love, sort of (חג האהבה). Portuguese is similar but includes day as well as love (Dia dos Namorados)

Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense Spanish does both ways with day and saint (Día de San Valentín).

You could continue playing with such phrases quite a bit as Translate translates N languages into N languages. So even caching from English to anything is problematic and the same word spelling can be found in many languages and the translation of that word into English also varies.

So if you had a function like ask_translate(from, to, phrase) then something like an @lru_cache() decorator would not suffice as caching might require either combining the three arguments into one longer string to cache, or multiple caches accessed by something like an NxN data structure to select which cache.

And it can also have other considerations such as not using naked queries as keys, but customizing them first into some canonical format such as removing leading and trailing whitespace as  well as multiple instances between words, shifting everything to lower case, removing some punctuation and perhaps replacing special characters with a single version. Otherwise, the cache may be too large and often less effective. 

As stated earlier, any analogies or comparisons I use are for the purpose of discussion only and opinions are mine.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Thomas Passin
Sent: Saturday, February 18, 2023 4:00 PM
To: python-list at python.org
Subject: Re: Comparing caching strategies

On 2/18/2023 2:59 PM, avi.e.gross at gmail.com wrote:
> I do not know the internals of any Roaring Bitmap implementation so 
> all I did gather was that once the problem is broken into accessing 
> individual things I chose to call zones for want of a more specific 
> name, then each zone is stored in one of an unknown number of ways depending on some logic.

Somewhat tangential, but back quite a while ago there was a C library called IIRC "julia list".  It implemented lists in several different ways, some quite sophisticated, depending on the size and usage pattern. 
  I remembered it as soon as I took a look at Roaring Bitmap and saw that the latter can use different representations depending on size and access patterns.

-- 
https://mail.python.org/mailman/listinfo/python-list