What is a type error?

Sun Jul 16 05:00:58 EDT 2006

Marshall schrieb:
> Joachim Durchholz wrote:
>> Marshall schrieb:
>>> I would say, records in SQL have value, and their
>>> identity is exactly their value.
 >>
>> Definitely not. You can have two equal records and update just one of
>> them, yielding non-equal records; by my definition (and by intuition),
>> this means that the records were equal but not identical.
> 
> This sort of thing comes up when one has made the mistake of
> not defining any keys on one's table and needs to rectify the
> situation. It is in fact considered quite a challenge to do.
> My preferred technique for such a situation is to create
> a new table with the same columns and SELECT DISTINCT ...
> INTO ... then recreate the original table. So that way doesn't
> fit your model.

I was mentioning multiple equal records just as an example where the 
records have an identity that's independent of the record values.

> How else might you do it? I suppose there are nonstandard
> extensions, such as UPDATE ... LIMIT. And in fact there
> might be some yucky thing that made it in to the standard
> that will work.

Right, it might be difficult to update multiple records that are exactly 
the same. It's not what SQL was intended for, and difficult to do anyway 
- I was thinking of LIMIT (I wasn't aware that it's nonstandard), and I 
agree that there may be other ways to do it.

However, I wouldn't overvalue that case. The Jane/John Doe example 
posted elsewhere in this thread is strictly within the confines of what 
SQL was built for, yet there is aliasing.

> But what you descrbe is certainly *not* possible in the
> relational algebra; alas that SQL doesn't hew closer
> to it. Would you agree?

Yup, SQL (particularly its update semantics) aren't relational semantics.
Still, it's SQL we've been talking about.

 > Would you also agree that
> if a language *did* adhere to relation semantics,
> then relation elements would *not* have identity?
> (Under such a language, one could not have two
> equal records, for example.)

Any identity that one could define within relational semantics would be 
just equality, so no, it wouldn't be very helpful (unless as part of a 
greater framework).
However, that's not really a surprise. Aliasing becomes relevant (even 
observable) only when there's updates, and relational algebra doesn't 
have updates.

>>  > I do not see that they have
>>> any identity outside of their value. We can uniquely identify
>>> any particular record via a key, but a table may have more
>>> than one key, and an update may change the values of one
>>> key but not another. So it is not possible in general to
>>> definitely and uniquely assign a mapping from each record
>>> of a table after an update to each record of the table before
>>> the update, and if you can't do that, then where
>>> is the record identity?
>> Such a mapping is indeed possible. Simply extend the table with a new
>> column, number the columns consecutively, and identify the records via
>> that column.
> 
> That doesn't work for me. It is one thing to say that for all
> tables T, for all elements e in T, e has identity. It is a different
> thing to say that for all tables T, there exists a table T' such
> that for all elements e in T', e has identity.

Let me continue that argument:
Records in T' have identity.
 From an SQL point-of-view, there's no difference between the records in 
T' and records in other tables, so they must share all properties, in 
particular that of having an identity.

(I agree that's not as convincing as seeing aliasing in action. See below.)

>> But even if you don't do that, there's still identity. It is irrelevant
>> whether the programs can directly read the value of the identity field;
>> the adverse effects happen because updates are in-place. (If every
>> update generated a new record, then we'd indeed have no identity.)
>>
>>> Okay. At this point, though, the term aliasing has become extremely
>>> general. I believe "i+1+1" is an alias for "i+2" under this definition.
>> No, "i+1+1" isn't an alias in itself. It's an expression - to be an
>> alias, it would have to be a reference to something.
>>
>> However, a[i+1+1] is an alias to a[i+2]. Not that this is particularly
>> important - 1+1 is replacable by 2 in every context, so this is
>> essentially the same as saying "a[i+2] is an alias of a[i+2]", which is
>> vacuously true.
> 
> To me, the SQL with the where clause is like i+2, not like a[2].

I'd say WHERE is like [i+2]: neither is valid on its own, it's the 
"selector" part of a reference into a table.

> A query is simply an expression that makes mention of a variable.
> However I would have to admit that if SQL table elements have
> identity, it would be more like the array example.

OK.

> (The model of SQL I use in my head is one with set semantics,
> and I am careful to avoid running in to bag semantics by, for
> example, always defining at least one key, and using SELECT
> DISTINCT where necessary. But it is true that the strict set
> semantics in my head are not the wobbly bag semantics in
> the standard.)

Set-vs.-bag doesn't affect SQL's aliasing in general, it was a specific 
property of the example that I chose. So you don't have to worry wether 
that might be the cause of the misunderstanding.

>>> That is so general that I am concerned it has lost its ability to
>>> identify problems specific to pointers.
 >>
>> If the reference to "pointers" above means "references", then I don't
>> know about any pointer problems that cannot be reproduced, in one form
>> or the other, in any of the other aliasing mechanisms.
> 
> It seems we agree that, for example, raw C pointers can easily
> cause aliasing problems. To me, Java references, which are much
> more tightly controlled, are still subject to aliasing problems,
> although
> in practice the situation is significantly less severe. (It is possible
> to have multiple references to a single Java mutable object and
> not know it.)

Exactly.

 > I would assume the same is posible with, say, SML refs.

So do I.

> However the situation in SQL strikes me as being different.
> If one is speaking of base tables, and one has two queries,
> one has everything one needs to know simply looking at
> the queries. If views are involved, one must further examine
> the definition of the view.

Sure. Usually, SQL itself is enough abstract that no additional 
abstraction happens; so even if you have aliasing, you usually know 
about it.
I.e. when compared to what you were saying about Java: "It is possible
to have multiple references to a single Java mutable object and
not know it.", the "and not know it" part is usually missing.

The picture changes as soon as the SQL statements get hidden behind 
layers of abstraction, say result sets and prepared query handles.

... Um... let me also restate the preconditions for aliasing become a 
problem. You need three elements for that:
1) Identities that can be stored in multiple places.
2) Updates.
3) Abstraction (of updatable entities).

Without identities, there cannot be aliasing.
Without updates, the operation that makes aliasing problematic is excluded.
Without abstraction, you may have aliasing but can clearly identify it, 
and don't have a problem.

I think the roadblock you're hitting is that you keep looking for 
nonlocal trouble to identify spots where SQL might have aliasing; since 
SQL isn't used with abstraction often, you come up empty and conclude 
there's no aliasing.
My position is that not all aliasing is evil, and that there's a 
technical definition: two program entities (l-expressions in C parlance) 
are aliases if changing the referenced data through one also changes the 
other.

> For example, we might ask in C, if we update a[i], will
> the value of b[j] change? And we can't say, because
> we don't know if there is aliasing. But in Fortran, we
> can say that it won't, because they are definitely
> different variables.

Even that may be the case.
There's an EQUIVALENCE statement that can explicitly alias array sections.
Also, I suspect that newer versions of Fortran have reference parameters.

 > But even so in Fortran, if we ask
> if we update a[i], will a[j] change, and we know we
> can't tell unless we know the values of i and j.

In fact, that's the usual form of aliasing in Fortran.

> In SQL, if we update table T, will a SELECT on table
> S change? We know it won't; they are different variables.
> But if we update T, will a select on T change? It
> certainly might, but I wouldn't call it aliasing, because
> the two variables involved are the same variable, T.

I fail to see how this is different from the aliasing in a[i] and a[j].

It's just the same kind of aliasing that I have with the statements
   SELECT * FROM T WHERE key = $i
and
   SELECT * FROM T WHERE key = $j
(where $i and $j are placeholders for variables from program variables).

After running
   UPDATE T SET f = 5 WHERE key = $i
not only the first SELECT may give different results, the second may, 
too (in those cases where $i and $j happen to have the same value).

I can also construct pointer-like aliasing in SQL. If I have
   SELECT * FROM employees WHERE country = "US"
and
   SELECT * FROM employees WHERE name = "Doe"
then somebody with a grudge against all US employees running
   UPDATE employees SET salary = 0 WHERE country = "US"
will (on purpose or not) also modify the results of the second query.

Not a problem while all three queries are in plain view, but imagine 
them wrapped in some opaque object or behind a module interface.

> I am still not convinced. It is not sufficient to show how sometimes
> you can establish identity, or how you could establish identity based
> on some related table; it must be shown how the identity can be
> established in general, and I don't think it can. If it can't, then
> I claim we are back to expressions that include a variable, and
> not aliasing, even by the wider definition.

See above.

I admit that identity cannot be reliably established in SQL. The 
identity of a record isn't available (unless via OID extensions).
However, it is there and it can have effect.

Translated to real life, "my best friend" and "my employer" may be 
referring to the same identity, even if these descriptions may change in 
the future.

>>  > It seems to me there are more different
>>> styles of parameter passing than there are isolation levels, but
>>> I don't usually see (competent) people (such as yourself) getting
>>> call-by-value confused with call-by-reference.
>> Indeed.
>> Though the number of parameter passing mechanisms isn't that large
>> anyway. Off the top of my head, I could recite just three (by value, by
>> reference, by name aka lazy), the rest are just combinations with other
>> concepts (in/out/through, most notably) or a mapping to implementation
>> details (by reference vs. "pointer by value" in C++, for example).
> 
> Fair enough. I could only remember three, but I was sure someone
> else could name more. :-)

I got a private mail naming more parameter passing mechanisms. It's a 
bit difficult define what's a parameter passing mechanism and what's 
just an attribute for such a mechanism. If you count all combinations as 
an extra mechanism, you can easily get dozens.

Regards,
Jo