[icfp-discuss] "Clive's Conjecture"
Clive Gifford
clive at xtra.co.nz
Tue Aug 1 19:39:52 EDT 2006
Earlier (see end of the message copied below) I posed some questions about
the performance hit for adding another level of a self hosting interpreter
to what was being run.
I've always wanted to have a conjecture with my name on it so here goes:
"For any language capable of 'self-hosting' itself (i.e. the ability to
express an interpreter for itself), then the lowest ratio between the total
cost of using N+1 levels of the self-hosting interpreter and the total cost
of using N levels to run any other program in that language is exactly the
Golden Ratio (1+sqrt(5))/2)!
N is a positive integer and the bottom level 'interpreter' of this stack can
be in any language or device that has a non-zero cost for each fundamental
operation that it can express."
I have a truly marvellous proof of this proposition which this email is too
narrow to contain. :-)
Clive
----- Original Message -----
From: "Clive Gifford" <clive at xtra.co.nz>
To: "Discussion of the 2006 ICFP Programming Contest"
<icfpcontest-discuss at lists.andrew.cmu.edu>
Sent: Wednesday, August 02, 2006 9:48 AM
Subject: Re: [icfp-discuss] A faster UM.
On Tue, 1 Aug 2006 05:40:00 -0400, Jed Davis wrote:
>
> At some point on the Friday of the contest, I thought about trying to
> do dynamic generation of '386 machine code; and then I decided that it
> wouldn't be useful enough and I shouldn't get sidetracked on it then.
>
> So, after the contest ended, I started working on such a device in my
> spare time. My knowledge of the state of the art in JIT compilation
> being approximately nil, I kind of made stuff up as I went along.
>
> That said, on the 3GHz Pentium 4 I wrote of earlier:
>
> ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total
Impressive result Jed! I'd certainly be interested in trying it out if you
were to release it.
I have also have been working on performance improvements to my code,
although at a "higher level" than you! :-)
I ended up with three UM implementations. One in "pure Python" (short and
sweet but very slow), another where most of the grunt work was done in Pyrex
(which compiles to C), and one in hand-coded C. The C version was the one we
eventually used during the contest but only after (stupidly) spending a lot
of time trying to get the others running fast enough. I haven't really done
anything else to improve that C version since. However I did go back and
clean up my Pyrex version (mainly to directly use C style memory allocation)
and it now runs only about 3 or 4 times slower than the hand coded C
version. But the pure Python implementation was a couple of hundred times
slower than the C version, and I've been spent a ridiculous amount of time
since the contest trying to make it a little faster - for no good reason
other than to try to prove it can be done! I've tried a number of different
approaches but mostly they have ended up even slower. So far my best Python
(plus psyco) version is still over 100 times slower than the C version.
My most recent and ambitious Python attempt dynamically generates and
compiles customised code blocks to handle each combination of op code and
values of A, B and C, (and then does some further "bytecode hacks" on the
generated bytecode) in an attempt to remove a level of indirection plus
speed up dispatch of each opcode. Unfortunately this fabulously complicated
version is not yet working!
Still, I've learned a lot about Python's internals along the way. Even
though about using a Python bytecode assembler directly but haven't got that
brave yet.
Anyway, another thing I noticed when playing with the organisers reference
implementation (um.um) was that each new layer on the onion seems to make
the run time about 24 times slower. I.e. running (um.um + um.um +
something.um) is about 24 times slower than running (um.um + something.um).
The factor of 24 seems to be constant regardless of which VM implementation
I use to run the whole thing - although I probably haven't really spent
enough time working through this methodically and carefully..
Which made me wonder....
1. How much effect does the design of the underlying VM interpreter have on
that factor of 24?
2. How much effect does the content of "something.um" have on the slowdown?
3. What is the lowest slowdown per onion layer that is theoretically
possible?
4. How about with different instruction sets?
I've tried to find some answers to these kinds of questions on the net but
without much success. There are probably some easy answers to some at least,
but the layers of recursion make my head hurt! I was hoping someone out
there might be able to shed some light for me.
Cheers
Clive Gifford
More information about the icfpcontest-discuss
mailing list