[icfp-discuss] The "Eigenratio" of Self-Interpreters
Clive Gifford
clive at xtra.co.nz
Sat Aug 12 22:38:27 EDT 2006
Greetings
I'd like to share a small "discovery" that I have made in the last few days.
I'm not sure if I am the first to "discover" this particular result or not.
Perhaps it is already well known in some circles, or "obvious" to others (in
which case I apologise in advance), but it is new for me, and I haven't been
able to find anything else on the Net that appears similar. Therefore any
feedback you may be willing to provide on its originality or otherwise, and
any references to descriptions of similar material elsewhere would be
appreciated!
My "discovery" concerns what I will refer to (for reasons that will become
clear later) as the "eigenratio" of a self-interpreter (or "meta-circular
interpreter" if you prefer that terminology). It came about from some
experiments I did after the contest ended, triggered initially by the
organisers' releasing their reference implementation of an interpreter for
the Universal Machine, in the form of a binary file (available from
http://icfpcontest.org/um.um). This interpreter is also a
"self-interpreter", able to run itself as well as other UM programs.
In some earlier postings (rather grandiosely entitled "Clive's Conjecture")
I mentioned some interesting (to me) things that I had noticed when timing
how long it took for a stack of multiple copies of um.um to interpret some
other top level programme, with some other interpreter at the very bottom of
this "stack".
To recap briefly, you can construct a stack of N um.um interpreters with
some other program on top of that and then run the whole thing with some
other valid interpreter (your own C implementation for example). If we then
look at the ratio between the time taken to run the stack with N+1
interpreters to the time taken to run the stack with just N interpreters, it
appears that the ratio converges to a value of around 24.5 as N increases.
I.e. As N increases, N+1 levels of the interpreter will take about 24.5
times as long as N levels to run the same top level program. It was also
apparent that the limit value of this ratio didn't change when a different
top level program was used (on the very top of the stack).
Later I added a few lines of code to my C interpreter so it would report
exactly how many times each instruction in the UM instruction set was being
executed at the lowest level (at the level where things were "really"
happening!). I collected those numbers with different numbers of um.um in
the stack, then put those values into a spreadsheet where I could experiment
with assign more or less arbitrary, but still fixed positive "costs" to each
instruction. Once again, it seemed that no matter what costs I assigned to
individual instructions, in the end the ratio of the total "cost" for N+1 to
just N levels of um.um running some other top level program still converged
to the same value of around 24.5!
I decided to implement my own simple virtual machine and instruction set,
and then to write a self-interpreter for it (in the same style as um.um) but
with the specific goal of trying to achieve a lower ratio than the 24.5 (in
the limiting case) that um.um produced. I did this with a very simple
machine with (initially) only 6 instructions and finally got my own
self-interpreter running. It showed the same kind of behaviour as um.um when
run as a stack of multiple copies of itself, but with a ratio of about 15.2.
After examining the code for this I added another specialised instruction to
replace a commonly used sequence of three others, and the updated VM and
self-interpreter achieved a ratio of about 11.5. (Still a long way to go to
get anywhere near the "Golden Ratio" that I plucked out of thin air for my
earlier "conjecture" but at least some progress was being made in the right
direction!) Another subsequent smaller tweak to the instruction set reduced
the ratio to about 11.2.
At this stage I decided to try a different approach, and after spending some
time with pen and paper found a tidy way to describe and directly calculate
the number of times each instruction would be executed in order to run a
stack of N self-interpreters plus some other program on top. The solution
was to use matrices.
The structure of my own various self-interpreters was simple. A very short
section of code executed once at start-up, and then a loop was entered that
on each iteration decoded the op code for the next instruction from the
program being interpreted (next level up) before branching to some code to
emulate that instruction (adjusting memory addresses etc., along the way).
Therefore for any instruction being interpreted, I could follow the
execution path through the interpreter and determine exactly which
instructions it executed and also how many times. From this I built a matrix
M with m rows and m columns (each row and column corresponding to one of the
m instructions in the instruction set). Each column of the matrix
corresponds to the interpretation of one instruction from the instruction
set, and the values in that column (reading the rows from top to bottom) are
exactly the number of times each instruction is executed in the current
level of the interpreter in order to handle that task. I.e. the value in row
i and column j is the number of times instruction i has to be executed
during the main loop in order to handle one instance of instruction j in the
code being interpreted.
Next, we need to construct an m by 1 vector (I'll call it P) where each
value is a count of times each instruction is executed during a full run of
the top level program. (The order of the values in this vector must match
the order used for the instructions in the rows and columns in M also. In my
case the op codes were integers from 0 to m-1 and the same ordering was used
in rows/columns of the matrices.)
Thirdly, construct one more vector (S) with the same dimensions as P where
the values are again the numbers of times the various instructions are
executed, but this time covering any code in the start-up section of the
interpreter (before it enters its main loop to handle instructions from the
program being interpreted).
Now (using juxtaposition to represent matrix multiplication), we can say:
P is the number of times instruction is executed to run the top level
program.
MP + S is the number of times each instruction is executed by a
self-interpreter running the top level program.
M(MP + S) + S = MMP + MS + S is the number of times each instruction is
executed by a stack of two self-interpreters running the top level program.
And so on for higher levels...
So far, so good. Now I was able to use the Numpy package for Python (which
provides support for various matrix operations) and after fixing a couple of
errors in my matrices was able to reproduce exactly the same figures for
counts of instructions executed as I had previously achieved but only by
having my own C interpreter track them.
I was pretty happy at that point, having found a nice tidy way to correctly
describe with matrices how to predict the number of times each instruction
would be executed for a given stack of N copies of some self-interpreter
plus some other program on top of that. But now I found myself wondering
what characteristic of the matrix M (ignoring S because it is very small in
comparison and therefore has very little effect on the big picture) produced
the apparently fixed ratios observed as described earlier.
Now it's been a long time since I've had any need to use matrix determinants
and so on but I at least I did vaguely remember the names of a few things of
that nature, if not the actual uses of them! So more or less on a whim I
googled "eigenvalues", found an online calculator, copied and pasted the
contents of my own 'M' matrix, and clicked "calculate". Imagine my surprise
when the first listed eigenvalue matched the (11.20...) ratio I had
observed - to some number of decimal digits!!
To cut a long story slightly shorter, I have now refreshed my memory on
eigenvalues and eigenvectors and have also found the description of the
"Power iterator" algorithm (http://en.wikipedia.org/wiki/Power_iteration)
which seems to provide at least a basis for understanding why repeated
multiplication by the matrix M eventually generates results that are
essentially identical to multiplying by the magnitude of the dominant
eigenvalue of M instead.
In summary, as we let N increase towards infinity, the ratio of total
instructions executed to run N+1 interpreters (plus a some other fixed
program as the very top level), to the total instructions executed when
running just N interpreters (plus the same top-level program), is very close
to the magnitude of the dominant eigenvector for the matrix M that
describes the internals of the interpreter (as described earlier).
The eigenvalue equation (see:
http://en.wikipedia.org/wiki/Eigenvalues#Eigenvalue_equation) tells us that:
M.(eigenvector k) = (eigenvalue k).(eigenvector k)
So (disregarding the effect of S) we can see that as we pile on more and
more interpreters onto the stack, the vector holding the current number of
times each instruction has been executed converges towards the "dominant
eigenvector" for the matrix M (which describes the heart of the interpreter)
and the ratio converges to the magnitude of the corresponding "dominant
eigenvalue".
So far I've just been talking about counts of instructions executed. We can
however easily include different costs for each different instruction in
another 1 by m vector. The dot product of this and the final counts vector
gives the total "cost". But when you calculate the ratio using a formula
that includes the costs it turns out to be easy to see that the terms cancel
and the final ratio is still the same as what got earlier using just the
number of instructions executed.
Where to from here? At one point I was hoping that when I understood the
relationship between a particular matrix M and the corresponding ratio, that
it might help me find the smallest possible ratio - and therefore (in some
sense) describe the "most efficient" self-interpreter possible. However,
clearly M could (at least in a purely hypothetical sense) be the identity
matrix and in that case the ratio is 1. Adding new levels of the interpreter
wouldn't make anything run more slowly because if M equals the identity
matrix (with ones down the leading diagonal and zeroes elsewhere) then we
are also saying each instruction from level N+1 of our stack of programs can
be emulated by a single instruction at level N. That doesn't make any sense
to me without appealing to some kind of magic or perhaps some not yet
discovered type of computing device that can do something for nothing. So
the question of what the lower limit is on the "eigenratio" for any
self-interpreter (and any "computing machine") is still an open question as
far as I can see.
Sorry for my verbosity! And thank you for your attention if you made it this
far. :-)
Clive Gifford
More information about the icfpcontest-discuss
mailing list