Understanding search queries, semantics, and "Meaning" ...aren't we all looking for meaning?

Tue Dec 30 16:50:51 EST 2008

On Dec 30, 12:35 pm, 5lvqbw... at sneakemail.com wrote:
> I have Section 4.4.1 of SICP rattling around in my head (database
> queries), and I'm trying to come up with a simple dictionary-based
> database in Python to represent circuit diagrams.  My main confusion
> isn't one of implementation, but a matter of "big thinking",
> fundamentally, about the problem. Please don't suggest using a SQL
> library, as I'm looking to learn to fish, so to speak, and to learn a
> bit about the biology of fish.
>

I'm going to break rule #1 of your requirements but in an unexpected
way. Rather than studying PostgreSQL, MySQL, or Oracle, why don't you
crack open the topic of relational database theory and relational
algebra? That should help with the "big thinking" bit in the same way
understanding 1+1 helps you understand how to add any two numbers
together. No, SQL is not the be-all-end-all of relational theory. Yes,
it does a pretty dang good job at it, though, which is why it is still
around.

> I've subclassed dict to hdict ("hashable dict") and rewritten the
> __hash__ function so I can include a dict into a set. Thanks to a
> previous poster here for providing that suggestion.
>
> A circuit component looks like this for example:
>
> comp1 = hdict(value=1e-6, footprint='0402', vendor='digikey')
> comp2 = hdict(value=10e3, footprint='0603', vendor='mouser')
> etc, etc.
>
> The database holds the values like this:
> db = dict() # normal dict
> db['value'] = ([1e-6, 10e3], [comp1, comp2]) #ordered set for fast
> lookup/insertion
> db['footprint'] = {'0402':set[comp1], '0603':comp2} # unordered uses
> normal dict for fast lookup
> db['vendor'] = {'digikey':[comp1], 'mouser':[comp2]}
>
> So basically the keys are the component parameters, and the values is
> the list of components with that value.  Stuff that is comparable is
> ordered; stuff that is discrete is not ordered, using either 2-tuples
> or dicts, respectively.
>
> This allows extremely fast lookup of components based on their
> properties, with O(1) performance for non-ordered stuff, and O(log n)
> performance for ordered stuff (using bisect methods).  The set
> operations are extremely fast, so I can do AND and OR operations on
> compound queries of this nature without worry.
>
> I need this speed not so much for selecting components when the user
> types in a query, but for when the mouse is hovering over the
> schematic and I need things to light up underneath, if the GUI is
> generating hundreds of mouse messages per second, and for inspector
> windows to appear quickly when you click, etc.  If you have ever used
> Altium, you can see how effective this is in terms of creating a good
> interactive user experience.
>

OK, turn it around in your head. Consider the indexes you built above
as yet another table. Consider what a table or a database really is.
When you see how they are all really the same thing, either you've
mastered relational algebra of you've seen the big picture. Kind of
like the cons cell being enough to describe any data structure in the
universe, a table is enough to describe everything in the relational
world.

> My question is what happens when I choose to search for NOT
> footprint='0402'.
>
> Should this return a blank list? This happens by default., and is in
> fact true: a blank list satisfies "not anything" actually.
> Should this return everything that is NOT footprint 0402 (ie returns
> 0603 components)?  This *strongly implies* a pre-selection of *all*
> objects before performing the negating function, or of looking at the
> ordering of other queries in the overall search term, and then
> applying NOT to the results of another query.
>
> But I'm suspicious of a brute force preselection of all objects
> whenever I see a NOT, and anyway it messes up the clean query/combiner
> method of search I'm doing, and it requires an implied sequence of
> search, where I'm pretty sure it should not rely on sequencing.  Even
> though this is single threaded, etc., the semantics of the query
> should not rely on ordering of the search term:
>
> footprint='0402' and NOT vendor='mouser' should return the same as
> NOT vendor='mouser' and footprint='0402'.
>
> So this is my philosophical quandary.  I'm not sure what the correct
> thing is.  In SICP they are using nondeterministic stuff which I don't
> quite get, so it's hard to follow.  Also they are not using
> dictionaries and hashes, so I'm not sure if their generate-and-test
> method would work here anyway.  Generate-and-test seems extremely
> inefficient.
>
> Can a wise guru please enlighten me?
>

(I don't think I qualify as a guru, but I think I see how I can help.)

In a typical SQL database, when you type in "SELECT foo FROM bar WHERE
baz='bo'", you are not writing a program, at least not in the sense of
Python or C or Java or Perl where you give instructions on HOW to run
the program. You are writing a program in the sense of Lisp or Scheme
or Haskell in that you are giving instructions on WHAT the program is.

The database doesn't, at least shouldn't, read in all the rows in a
table and then start going through the process of elimination removing
rows it doesn't want to think about. That's highly inefficient, of
course. Instead, it transforms the query (the WHAT) into a set of
procedures that describe HOW to get the result.

That step---taking the definition of the program and turning it into
actual instructions on HOW to run the program---is the compilation and
optimization phase. It's a very hard step to do, but no harder than
the process Python goes through taking files of bytes and converting
them into Python bytecode.

Oh, by the way, this step is nondeterministic. Why? Well, no one can
really say what the BEST way to run any sufficiently complicated
program is. We can point out good ways and bad ways, but not the best
way. It's like the travelling salesman problem in a way.

Now, that "hard stuff" you are glossing over in SICP is the IMPORTANT
stuff. It's the WHOLE POINT of that section. Stop glossing over it.
Keep working at it bit by bit until you understand it. You, as a
programmer, will be infinitely better for it. When you understand it,
you will really unlock the full potential of Python and Scheme and
whatever other language is out there because you will see how to go
from HOW languages to WHAT languages.

Programmers who can write compilers are GOOD programmers. Programmers
who can understand someone else's compilers are even better.

On Python-specific stuff: For a moment, forget about Python and hashes
and dicts and arrays and lists and such. Focus only on the cons cell
and Scheme. When you have mastered those concepts, you will see how to
efficiently use the Python data structures and how to simplify the
Scheme code into Python. Right now you are doing a bit of premature
optimization.

That's my semi-guru advice.