What is a type error?

Sat Jul 15 22:36:16 EDT 2006

Joachim Durchholz wrote:
> Marshall schrieb:
> > Joachim Durchholz wrote:
> >> As I said elsewhere, the record has an identity even though it isn't
> >> explicit in SQL.
> >
> > Hmmmm. What can this mean?
> >
> > In general, I feel that "records" are not the right conceptual
> > level to think about.
>
> They are, when it comes to aliasing of mutable data. I think it's
> justified by the fact that aliased mutable data has a galling tendency
> to break abstraction barriers. (More on this on request.)

I can see how pointers, or other kinds of aliasing
(by my more restricted definition) break abstraction barries;
it is hard to abstract something that can change out from
under you uncontrollably.

> > In any event, I am not sure what you mean by non-explicit
> > identity.
>
> The identity isn't visible from inside SQL. (Unless there's an OID
> facility available, which *is* an explicit identity.)

Agreed about OIDs.

>  > I would say, records in SQL have value, and their
> > identity is exactly their value.
>
> Definitely not. You can have two equal records and update just one of
> them, yielding non-equal records; by my definition (and by intuition),
> this means that the records were equal but not identical.

This sort of thing comes up when one has made the mistake of
not defining any keys on one's table and needs to rectify the
situation. It is in fact considered quite a challenge to do.
My preferred technique for such a situation is to create
a new table with the same columns and SELECT DISTINCT ...
INTO ... then recreate the original table. So that way doesn't
fit your model.

How else might you do it? I suppose there are nonstandard
extensions, such as UPDATE ... LIMIT. And in fact there
might be some yucky thing that made it in to the standard
that will work.

But what you descrbe is certainly *not* possible in the
relational algebra; alas that SQL doesn't hew closer
to it. Would you agree? Would you also agree that
if a language *did* adhere to relation semantics,
then relation elements would *not* have identity?
(Under such a language, one could not have two
equal records, for example.)

[The above paragraph is the part of this post that I
really care about; feel free to ignore the rest if it's naive
or annoying or whatever, but please do respond to this.]

>  > I do not see that they have
> > any identity outside of their value. We can uniquely identify
> > any particular record via a key, but a table may have more
> > than one key, and an update may change the values of one
> > key but not another. So it is not possible in general to
> > definitely and uniquely assign a mapping from each record
> > of a table after an update to each record of the table before
> > the update, and if you can't do that, then where
> > is the record identity?
>
> Such a mapping is indeed possible. Simply extend the table with a new
> column, number the columns consecutively, and identify the records via
> that column.

That doesn't work for me. It is one thing to say that for all
tables T, for all elements e in T, e has identity. It is a different
thing to say that for all tables T, there exists a table T' such
that for all elements e in T', e has identity.

> But even if you don't do that, there's still identity. It is irrelevant
> whether the programs can directly read the value of the identity field;
> the adverse effects happen because updates are in-place. (If every
> update generated a new record, then we'd indeed have no identity.)
>
> > Okay. At this point, though, the term aliasing has become extremely
> > general. I believe "i+1+1" is an alias for "i+2" under this definition.
>
> No, "i+1+1" isn't an alias in itself. It's an expression - to be an
> alias, it would have to be a reference to something.
>
> However, a[i+1+1] is an alias to a[i+2]. Not that this is particularly
> important - 1+1 is replacable by 2 in every context, so this is
> essentially the same as saying "a[i+2] is an alias of a[i+2]", which is
> vacuously true.

To me, the SQL with the where clause is like i+2, not like a[2].
A query is simply an expression that makes mention of a variable.
However I would have to admit that if SQL table elements have
identity, it would be more like the array example.

(The model of SQL I use in my head is one with set semantics,
and I am careful to avoid running in to bag semantics by, for
example, always defining at least one key, and using SELECT
DISTINCT where necessary. But it is true that the strict set
semantics in my head are not the wobbly bag semantics in
the standard.)

> There's another aspect here. If two expressions are always aliases to
> the same mutable, that's usually easy to determine; this kind of
> aliasing is usually not much of a problem.
> What's more of a problem are those cases where there's occasional
> aliasing. I.e. a[i] and a[j] may or may not be aliases of each other,
> depending on the current value of i and j, and *that* is a problem
> because the number of code paths to be tested doubles. It's even more of
> a problem because testing with random data will usually not uncover the
> case where the aliasing actually happens; you have to go around and
> construct test cases specifically for the code paths that have aliasing.
> Given that references may cross abstraction barriers (actually that's
> often the purpose of constructing a reference in the first place), this
> means you have to look for your test cases at multiple levels of
> software abstraction, and *that* is really, really bad.
>
> > That is so general that I am concerned it has lost its ability to
> > identify problems specific to pointers.
>
> If the reference to "pointers" above means "references", then I don't
> know about any pointer problems that cannot be reproduced, in one form
> or the other, in any of the other aliasing mechanisms.

It seems we agree that, for example, raw C pointers can easily
cause aliasing problems. To me, Java references, which are much
more tightly controlled, are still subject to aliasing problems,
although
in practice the situation is significantly less severe. (It is possible
to have multiple references to a single Java mutable object and
not know it.) I would assume the same is posible with, say, SML
refs.

However the situation in SQL strikes me as being different.
If one is speaking of base tables, and one has two queries,
one has everything one needs to know simply looking at
the queries. If views are involved, one must further examine
the definition of the view.

For example, we might ask in C, if we update a[i], will
the value of b[j] change? And we can't say, because
we don't know if there is aliasing. But in Fortran, we
can say that it won't, because they are definitely
different variables. But even so in Fortran, if we ask
if we update a[i], will a[j] change, and we know we
can't tell unless we know the values of i and j.

In SQL, if we update table T, will a SELECT on table
S change? We know it won't; they are different variables.
But if we update T, will a select on T change? It
certainly might, but I wouldn't call it aliasing, because
the two variables involved are the same variable, T.

> > Again, by generalizing the term this far, I am concerned with a
> > loss of precision. If "joe" in the prolog is a references, then
> > "reference" is just a term for "data" that is being used in a
> > certain way. The conection with a specfic address space
> > has been lost in favor of the full domain of the datatype.
>
> Aliasing is indeed a more general idea that goes beyond address spaces.
>
> However, identity and aliasing can be defined in fully abstract terms,
> so I welcome this opportunity to get rid of a too-concrete model.
>
> >> The records still have identities. It's possible to have two WHERE
> >> clauses that refer to the same record, and if you update the record
> >> using one WHERE clause, the record returned by the other WHERE clause
> >> will have changed, too.
> >
> > Is this any different from saying that an expression that includes
> > a variable will produce a different value if the variable changes?
>
> Yes.
> Note that the WHERE clause properly includes array indexing (just set up
> a table that has continuous numeric primary key, and a single other column).

I am still not convinced. It is not sufficient to show how sometimes
you can establish identity, or how you could establish identity based
on some related table; it must be shown how the identity can be
established in general, and I don't think it can. If it can't, then
I claim we are back to expressions that include a variable, and
not aliasing, even by the wider definition.

> I.e. I'm not talking about how a[i] is an alias of a[i+1] after updating
> i, I'm talking about how a[i] may be an alias of a[j].
>
> > It seems odd to me to suggest that "i+1" has identity.
>
> It doesn't (unless it's passed around as a closure, but that's
> irrelevant to this discussion).
> "i" does have identity. "a[i]" does have identity. "a[i+1]" does have
> identity.
> Let me say that for purposes of this discussion, if it can be assigned
> to (or otherwise mutated), it has identity. (We *can* assign identity to
> immutable things, but it's equivalent to equality and not interesting
> for this discussion.)
>
>  > I can see
> > that i has identity, but I would say that "i+1" has only value.
>
> Agreed.

Cool.

> > But perhaps the ultimate upshoot of this thread is that my use
> > of terminology is nonstandard.
>
> It's somewhat restricted, but not really nonstandard.

Whew!

> >> Possibly. There are so many isolation levels that I have to look them up
> >> whenever I want to get the terminology 100% correct.
> >
> > Hmmm. Is it that there are so many, or that they are simply not
> > part of our daily concern?
>
> I guess it's the latter. IIRC there are four or five isolation levels.

Yes, four.

>  > It seems to me there are more different
> > styles of parameter passing than there are isolation levels, but
> > I don't usually see (competent) people (such as yourself) getting
> > call-by-value confused with call-by-reference.
>
> Indeed.
> Though the number of parameter passing mechanisms isn't that large
> anyway. Off the top of my head, I could recite just three (by value, by
> reference, by name aka lazy), the rest are just combinations with other
> concepts (in/out/through, most notably) or a mapping to implementation
> details (by reference vs. "pointer by value" in C++, for example).

Fair enough. I could only remember three, but I was sure someone
else could name more. :-)

Marshall