Python un-plugging the Interpreter

Fri Apr 20 15:19:18 EDT 2007

Steve Holden wrote:
> Marc 'BlackJack' Rintsch wrote:
> 
>> In <xRRVh.11903$Kd3.4637 at newssvr27.news.prodigy.net>, John Nagle wrote:
>>
>>>     Many cases are easy.  If a smart compiler sees
>>>
>>>     for i in range(n) :
>>>        ... # something
>>>
>>> and there are no other assignments to "i", then it's clear that
>>> "i" can be represented as an integer, without "boxing" into a
>>> general object.
>>
>>
>> How is it clear that `i` is restricted to integers?  That works only if
>> you assume `range` refers to the built-in `range()` function.  So the
>> smart compiler has to check all possible control flows up to this point
>> and be sure `range` was not bound to something different.
>>
> That is, of course, exactly what a smart compiler would do. Only 
> nowadays it would quite possibly do it just-in-time (so I suppose you 
> might call it the run-time, though the boundary is continually getting 
> more blurred) rather than as the result of static analysis, so it would 
> *know* that the built-in generator had been called.

    That's always a question in compiler implementation.  99.99% of the time,
"range" isn't redefined, and the optimization is a big win, so it's worth
doing.  With enough heavy machinery, you can make the hard cases work.
Or you can disallow some things that aren't really that useful.
Generally, this sort of thing is necessary if there's a legacy code
base which uses that feature, and otherwise not.  Scanning big databases
of code for uses of wierd features is useful when making that decision.

    A good example at the CPU level is the ability to store into the code of a
program that's running.  This was a popular programming technique
among some  DOS programmers in the 1980s, and there's still some legacy
code that uses it.  Every IA-32 CPU supports this.  And it wasn't easy to
make that work.

    What happens when you store into code just about to be executed in
a modern superscalar CPU is something like this.

    Execution is going along, with the instruction pipeline filled perhaps
ten instructions ahead, with the next ten or so instructions decoded and ready
to go.  Several operations are in progress, perhaps a floating point
operation or two and some integer operations, a few memory fetches are underway,
and some memory stores are queued waiting for confirmation that everything is OK
for them to be committted.  "Commitment" works like a database commit; only when
everything is clear for the results of an instruction to be accepted, and
no bad cases have occured like an exception or a page fault, do the
actual writes to memory occur.  If some exception occurs, the state of
the CPU has to be rolled back to just before the exception.  So there's
a back end on the CPU called the "retirement unit" that handles this
finishing and cleanup process.

    So that's the basic architecture, incredibly oversimplified.  So what
happens when code stores into itself?

    The code stores into an instruction about to be executed, one that's
already in the pipeline.  The "retirement unit" detects a conflict
between the store and the addresses in the instruction pipeline.
There's hardware to handle this, but since it's a very rare event,
it's not an optimized path in the CPU.

    So the retirement unit orders most of the CPU to do a stop and flush.
Instruction fetching stops.  The CPU stalls while all floating point
operations and loads run to completion.  The pipeline is run out, but
none of its results are committed.

    Then the pipeline and all intermediate results are flushed, losing
any pending work.  Pending stores after the store into the code are
flushed.  Predecoded instructions are flushed (a big cost on an
AMD CPU, which decodes a whole cache line at once).  The whole CPU
is backed up to just before the store into the code occurs.  Only
then can the store into the code be allowed to occur.  That store is
committed, done, and pushed out to cache.

    Next, the CPU execution engine restarts, as if from an interrupt.
Instructions are reloaded from cache and decoded.  The pipeline
is refilled.  The superscaler units crank up again, and finally,
normal execution resumes.  The whole thing has cost about as much
time as handling an interrupt.

    Now, that's the x86 approach - allow all this to support legacy DOS
code.  On PowerPC, IBM simply disallowed it.  You can't store into code;
it's under memory protection.  To change code, the OS must remove
the executable mode from the page, load the memory with new code,
reset the executable mode, and explictly tell the CPU to
invalidate the instruction cache.   This simplifies the CPU,
and since there was no legacy code base, it wasn't a problem.

    You have to look at the legacy code base to decide if features
like that are worth keeping.

				John Nagle