From icfp at jdev.users.panix.com Tue Aug 1 05:40:00 2006 From: icfp at jdev.users.panix.com (Jed Davis) Date: Tue Aug 1 05:40:05 2006 Subject: [icfp-discuss] A faster UM. Message-ID: <20060801094000.GA19452@panix.com> At some point on the Friday of the contest, I thought about trying to do dynamic generation of '386 machine code; and then I decided that it wouldn't be useful enough and I shouldn't get sidetracked on it then. So, after the contest ended, I started working on such a device in my spare time. My knowledge of the state of the art in JIT compilation being approximately nil, I kind of made stuff up as I went along. That said, on the 3GHz Pentium 4 I wrote of earlier: ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total Important maneuvers here were: mapping array 0's data at virtual address 0, so "array index" could be inlined; doing exciting things in a SIGSEGV handler, so amendment could be inlined (except for pages that modify themselves); and of course writing and tuning a custom allocator (especially on Mac OS X, whose malloc seemed substantially worse than, say, NetBSD's). What proved less effective, to my dismay, were my attempts at optimizing the generated code; there was only so much that could be done within the confines of translating each UM insn into exactly 12 bytes of '386 code, and allowing branches to arbitrary locations of same. Without those, the time rises to all of 12.4s on that P4; other flavors of CPU seem to appreciate these efforts a little more, but it's still not that hot. I suppose I ought to think about releasing this thing, but it'd need a little work on portability, and perhaps a small README, first. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) From volezheng at gmail.com Tue Aug 1 09:22:36 2006 From: volezheng at gmail.com (zheng hao) Date: Tue Aug 1 14:56:24 2006 Subject: [icfp-discuss] "howie" adventure problem Message-ID: Firstly, the contest is really fantastic! I'm still working and have fun with it. But a problem stucked me. I can't proceed when I get into the Science Museum after I use the uploader. Can anyboby tell me how to get into the History of Sci&Tech Exhibit? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.andrew.cmu.edu/pipermail/icfpcontest-discuss/attachments/20060801/29f1fba9/attachment.html From bluelive at xs4all.nl Tue Aug 1 17:07:39 2006 From: bluelive at xs4all.nl (Bart van der Werf) Date: Tue Aug 1 17:07:45 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: <20060801094000.GA19452@panix.com> Message-ID: <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> Great project :) How did you handle the issue with self modifying um code ? -----Original Message----- From: icfpcontest-discuss-bounces@lists.andrew.cmu.edu [mailto:icfpcontest-discuss-bounces@lists.andrew.cmu.edu] On Behalf Of Jed Davis Sent: Tuesday, August 01, 2006 11:40 AM To: icfpcontest-discuss@lists.andrew.cmu.edu Subject: [icfp-discuss] A faster UM. At some point on the Friday of the contest, I thought about trying to do dynamic generation of '386 machine code; and then I decided that it wouldn't be useful enough and I shouldn't get sidetracked on it then. So, after the contest ended, I started working on such a device in my spare time. My knowledge of the state of the art in JIT compilation being approximately nil, I kind of made stuff up as I went along. That said, on the 3GHz Pentium 4 I wrote of earlier: ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total Important maneuvers here were: mapping array 0's data at virtual address 0, so "array index" could be inlined; doing exciting things in a SIGSEGV handler, so amendment could be inlined (except for pages that modify themselves); and of course writing and tuning a custom allocator (especially on Mac OS X, whose malloc seemed substantially worse than, say, NetBSD's). What proved less effective, to my dismay, were my attempts at optimizing the generated code; there was only so much that could be done within the confines of translating each UM insn into exactly 12 bytes of '386 code, and allowing branches to arbitrary locations of same. Without those, the time rises to all of 12.4s on that P4; other flavors of CPU seem to appreciate these efforts a little more, but it's still not that hot. I suppose I ought to think about releasing this thing, but it'd need a little work on portability, and perhaps a small README, first. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) _______________________________________________ icfpcontest-discuss mailing list icfpcontest-discuss@lists.andrew.cmu.edu https://lists.andrew.cmu.edu/mailman/listinfo/icfpcontest-discuss From clive at xtra.co.nz Tue Aug 1 17:48:43 2006 From: clive at xtra.co.nz (Clive Gifford) Date: Tue Aug 1 17:49:16 2006 Subject: [icfp-discuss] A faster UM. References: <20060801094000.GA19452@panix.com> Message-ID: <004401c6b5b4$47099970$0301a8c0@Owhango1> On Tue, 1 Aug 2006 05:40:00 -0400, Jed Davis wrote: > > At some point on the Friday of the contest, I thought about trying to > do dynamic generation of '386 machine code; and then I decided that it > wouldn't be useful enough and I shouldn't get sidetracked on it then. > > So, after the contest ended, I started working on such a device in my > spare time. My knowledge of the state of the art in JIT compilation > being approximately nil, I kind of made stuff up as I went along. > > That said, on the 3GHz Pentium 4 I wrote of earlier: > > ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total Impressive result Jed! I'd certainly be interested in trying it out if you were to release it. I have also have been working on performance improvements to my code, although at a "higher level" than you! :-) I ended up with three UM implementations. One in "pure Python" (short and sweet but very slow), another where most of the grunt work was done in Pyrex (which compiles to C), and one in hand-coded C. The C version was the one we eventually used during the contest but only after (stupidly) spending a lot of time trying to get the others running fast enough. I haven't really done anything else to improve that C version since. However I did go back and clean up my Pyrex version (mainly to directly use C style memory allocation) and it now runs only about 3 or 4 times slower than the hand coded C version. But the pure Python implementation was a couple of hundred times slower than the C version, and I've been spent a ridiculous amount of time since the contest trying to make it a little faster - for no good reason other than to try to prove it can be done! I've tried a number of different approaches but mostly they have ended up even slower. So far my best Python (plus psyco) version is still over 100 times slower than the C version. My most recent and ambitious Python attempt dynamically generates and compiles customised code blocks to handle each combination of op code and values of A, B and C, (and then does some further "bytecode hacks" on the generated bytecode) in an attempt to remove a level of indirection plus speed up dispatch of each opcode. Unfortunately this fabulously complicated version is not yet working! Still, I've learned a lot about Python's internals along the way. Even though about using a Python bytecode assembler directly but haven't got that brave yet. Anyway, another thing I noticed when playing with the organisers reference implementation (um.um) was that each new layer on the onion seems to make the run time about 24 times slower. I.e. running (um.um + um.um + something.um) is about 24 times slower than running (um.um + something.um). The factor of 24 seems to be constant regardless of which VM implementation I use to run the whole thing - although I probably haven't really spent enough time working through this methodically and carefully.. Which made me wonder.... 1. How much effect does the design of the underlying VM interpreter have on that factor of 24? 2. How much effect does the content of "something.um" have on the slowdown? 3. What is the lowest slowdown per onion layer that is theoretically possible? 4. How about with different instruction sets? I've tried to find some answers to these kinds of questions on the net but without much success. There are probably some easy answers to some at least, but the layers of recursion make my head hurt! I was hoping someone out there might be able to shed some light for me. Cheers Clive Gifford From clive at xtra.co.nz Tue Aug 1 19:39:52 2006 From: clive at xtra.co.nz (Clive Gifford) Date: Tue Aug 1 19:40:23 2006 Subject: [icfp-discuss] "Clive's Conjecture" Message-ID: <00ac01c6b5c3$cec36580$0301a8c0@Owhango1> Earlier (see end of the message copied below) I posed some questions about the performance hit for adding another level of a self hosting interpreter to what was being run. I've always wanted to have a conjecture with my name on it so here goes: "For any language capable of 'self-hosting' itself (i.e. the ability to express an interpreter for itself), then the lowest ratio between the total cost of using N+1 levels of the self-hosting interpreter and the total cost of using N levels to run any other program in that language is exactly the Golden Ratio (1+sqrt(5))/2)! N is a positive integer and the bottom level 'interpreter' of this stack can be in any language or device that has a non-zero cost for each fundamental operation that it can express." I have a truly marvellous proof of this proposition which this email is too narrow to contain. :-) Clive ----- Original Message ----- From: "Clive Gifford" To: "Discussion of the 2006 ICFP Programming Contest" Sent: Wednesday, August 02, 2006 9:48 AM Subject: Re: [icfp-discuss] A faster UM. On Tue, 1 Aug 2006 05:40:00 -0400, Jed Davis wrote: > > At some point on the Friday of the contest, I thought about trying to > do dynamic generation of '386 machine code; and then I decided that it > wouldn't be useful enough and I shouldn't get sidetracked on it then. > > So, after the contest ended, I started working on such a device in my > spare time. My knowledge of the state of the art in JIT compilation > being approximately nil, I kind of made stuff up as I went along. > > That said, on the 3GHz Pentium 4 I wrote of earlier: > > ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total Impressive result Jed! I'd certainly be interested in trying it out if you were to release it. I have also have been working on performance improvements to my code, although at a "higher level" than you! :-) I ended up with three UM implementations. One in "pure Python" (short and sweet but very slow), another where most of the grunt work was done in Pyrex (which compiles to C), and one in hand-coded C. The C version was the one we eventually used during the contest but only after (stupidly) spending a lot of time trying to get the others running fast enough. I haven't really done anything else to improve that C version since. However I did go back and clean up my Pyrex version (mainly to directly use C style memory allocation) and it now runs only about 3 or 4 times slower than the hand coded C version. But the pure Python implementation was a couple of hundred times slower than the C version, and I've been spent a ridiculous amount of time since the contest trying to make it a little faster - for no good reason other than to try to prove it can be done! I've tried a number of different approaches but mostly they have ended up even slower. So far my best Python (plus psyco) version is still over 100 times slower than the C version. My most recent and ambitious Python attempt dynamically generates and compiles customised code blocks to handle each combination of op code and values of A, B and C, (and then does some further "bytecode hacks" on the generated bytecode) in an attempt to remove a level of indirection plus speed up dispatch of each opcode. Unfortunately this fabulously complicated version is not yet working! Still, I've learned a lot about Python's internals along the way. Even though about using a Python bytecode assembler directly but haven't got that brave yet. Anyway, another thing I noticed when playing with the organisers reference implementation (um.um) was that each new layer on the onion seems to make the run time about 24 times slower. I.e. running (um.um + um.um + something.um) is about 24 times slower than running (um.um + something.um). The factor of 24 seems to be constant regardless of which VM implementation I use to run the whole thing - although I probably haven't really spent enough time working through this methodically and carefully.. Which made me wonder.... 1. How much effect does the design of the underlying VM interpreter have on that factor of 24? 2. How much effect does the content of "something.um" have on the slowdown? 3. What is the lowest slowdown per onion layer that is theoretically possible? 4. How about with different instruction sets? I've tried to find some answers to these kinds of questions on the net but without much success. There are probably some easy answers to some at least, but the layers of recursion make my head hurt! I was hoping someone out there might be able to shed some light for me. Cheers Clive Gifford From jonathan.roewen at gmail.com Tue Aug 1 19:50:39 2006 From: jonathan.roewen at gmail.com (Jonathan Roewen) Date: Tue Aug 1 19:50:45 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> References: <20060801094000.GA19452@panix.com> <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> Message-ID: > How did you handle the issue with self modifying um code ? Testing codex.umz, umix, and sandmark.umz, neither of these use self-modifying code. At most, they load a new sub-program. Jonathan From clive at xtra.co.nz Tue Aug 1 19:57:35 2006 From: clive at xtra.co.nz (Clive Gifford) Date: Tue Aug 1 19:58:02 2006 Subject: [icfp-discuss] "Clive's Conjecture" References: <00ac01c6b5c3$cec36580$0301a8c0@Owhango1> Message-ID: <00b301c6b5c6$45971880$0301a8c0@Owhango1> Slightly updated version to make it clearer... "For any language capable of 'self-hosting' itself (i.e. having the ability to express an interpreter for itself), then the lowest ratio possible between the total cost of using N+1 levels of any self-hosting interpreter in that language and the total cost of using N levels of the same interpreter, to run any other program in that language, cannot be less than the Golden Ratio (1+sqrt(5))/2)! N is a positive integer and the bottom level 'interpreter' of this stack can be in any language or device that has a non-zero cost for each fundamental operation that it can express." From ganesh at earth.li Tue Aug 1 20:01:04 2006 From: ganesh at earth.li (Ganesh Sittampalam) Date: Tue Aug 1 20:01:08 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: <20060801094000.GA19452@panix.com> References: <20060801094000.GA19452@panix.com> Message-ID: On Tue, 1 Aug 2006, Jed Davis wrote: > there was only so much that could be done within the confines of > translating each UM insn into exactly 12 bytes of '386 code, and > allowing branches to arbitrary locations of same I guess the "exactly 12 bytes" restriction is so you can compute branches quickly? How about having two versions of the code. One that obeys this restriction, and another that allows both variable length and cross-instruction optimisations. Every so often in the optimised version, you can have a "resync point" across which optimisations are disabled. Then replace the corresponding instruction in the unoptimised version with a branch to this point. Branches in the code being executed would always jump to the unoptimised version, and execution would continue in that until a resync point was reached, when it would switch to the optimised version. As a further optimisation, you could scan the code for branches to absolute locations, use those locations to guide your choice of resync points, and then make the branches go straight back to the optimised code. Cheers, Ganesh From chris at chr-breitkopf.de Wed Aug 2 04:18:15 2006 From: chris at chr-breitkopf.de (Christoph Breitkopf) Date: Wed Aug 2 04:18:25 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: <20060801094000.GA19452@panix.com> References: <20060801094000.GA19452@panix.com> Message-ID: Jed Davis writes: > That said, on the 3GHz Pentium 4 I wrote of earlier: > > ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total Great results. > What proved less effective, to my dismay, were my attempts at optimizing > the generated code; there was only so much that could be done within the > confines of translating each UM insn into exactly 12 bytes of '386 code, > and allowing branches to arbitrary locations of same. Without those, > the time rises to all of 12.4s on that P4; other flavors of CPU seem to > appreciate these efforts a little more, but it's still not that hot. How about targeting x86_64 instead, and using r8 - r15 directly for the UM registers? This makes for some very nice instruction mappings (add translates directly to LEA, for example). If you want room for self-modifying code, you'll get lots of nops, though. Regards, Chris From windenntw at gmail.com Wed Aug 2 13:40:54 2006 From: windenntw at gmail.com (Antonio Vargas) Date: Wed Aug 2 13:40:57 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: References: <20060801094000.GA19452@panix.com> Message-ID: <69304d110608021040x5bf2cf57x6bfa06350f08d2ca@mail.gmail.com> On 02 Aug 2006 10:18:15 +0200, Christoph Breitkopf wrote: > Jed Davis writes: > > That said, on the 3GHz Pentium 4 I wrote of earlier: > > > > ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total > > Great results. > > > What proved less effective, to my dismay, were my attempts at optimizing > > the generated code; there was only so much that could be done within the > > confines of translating each UM insn into exactly 12 bytes of '386 code, > > and allowing branches to arbitrary locations of same. Without those, > > the time rises to all of 12.4s on that P4; other flavors of CPU seem to > > appreciate these efforts a little more, but it's still not that hot. > > How about targeting x86_64 instead, and using r8 - r15 directly for > the UM registers? This makes for some very nice instruction mappings (add > translates directly to LEA, for example). If you want room for self-modifying > code, you'll get lots of nops, though. > I've just also started to code a UM-to-PPC JIT... for solving the problem of code addressing for jump, I'm leaning towards just encode simple instructions into their corresponding ppc one (thus get directly 1:1 size ratio), and then for complex UM instructions such as conditional move, just make a subroutine call to a external function which is different for each register combo, so that inline code is always 4bytes-per-um-instruction but can execute anything else needed such as allocs and the like. For better speed, I'm using physical ppc registers for um ones, so that there is no need to constantly spill to memory while computing. -- Greetz, Antonio Vargas aka winden of network http://network.amigascne.org/ windNOenSPAMntw@gmail.com thesameasabove@amigascne.org Every day, every year you have to work you have to study you have to scene. From popiel at wolfskeep.com Wed Aug 2 15:32:11 2006 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Aug 2 15:32:15 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: References: <20060801094000.GA19452@panix.com> <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> Message-ID: <20060802193211.8F9EE2DDD4@cashew.wolfskeep.com> In message: "Jonathan Roewen" writes: >> How did you handle the issue with self modifying um code ? > >Testing codex.umz, umix, and sandmark.umz, neither of these use >self-modifying code. > >At most, they load a new sub-program. This feels like an unsatisfactory answer... and it seems that supporting self-modifying code wouldn't be all that expensive. All you need to do is for amendments to the '0' array, replace the compiled code corresponding to that platter with a call to recompile the '0' array and resume at that position (that is, replace the compiled code with the equivalent of a Load Program for a non-compiled array). This probably doubles the cost of array amendment to arrays other than the '0' array, and maybe quadruples the cost of amendment to '0'... but it then fully supports the UM language. Further, array amendments that can be proved to not affect the '0' array (through dataflow analysis, etc) can skip the extra checks and work... - Alex From popiel at wolfskeep.com Wed Aug 2 15:39:22 2006 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Aug 2 15:39:24 2006 Subject: [icfp-discuss] "Clive's Conjecture" In-Reply-To: <00b301c6b5c6$45971880$0301a8c0@Owhango1> References: <00ac01c6b5c3$cec36580$0301a8c0@Owhango1> <00b301c6b5c6$45971880$0301a8c0@Owhango1> Message-ID: <20060802193922.3C6552DDAB@cashew.wolfskeep.com> In message: <00b301c6b5c6$45971880$0301a8c0@Owhango1> "Clive Gifford" writes: >Slightly updated version to make it clearer... > >"For any language capable of 'self-hosting' itself (i.e. having the ability >to express an interpreter for itself), then the lowest ratio possible >between the total cost of using N+1 levels of any self-hosting interpreter >in that language and the total cost of using N levels of the same >interpreter, to run any other program in that language, cannot be less than >the Golden Ratio (1+sqrt(5))/2)! > >N is a positive integer and the bottom level 'interpreter' of this stack can >be in any language or device that has a non-zero cost for each fundamental >operation that it can express." I think that JITs falsify this conjecture; the purpose of a JIT is to make the hosting cost proportional to code size, not execution time. A JIT hosted by itself would get compiled, and then would run at (about) the same speed as the original... effectively unwinding the N+1 back down to N. Of course, I suppose you could argue that a JIT is a compiler, not an interpreter, and thus the conjecture doesn't apply to it... - Alex From crary at cs.cmu.edu Wed Aug 2 16:23:59 2006 From: crary at cs.cmu.edu (Karl Crary) Date: Wed Aug 2 16:24:02 2006 Subject: [icfp-discuss] "howie" adventure problem In-Reply-To: References: Message-ID: <44D109DF.3010403@cs.cmu.edu> You will need to decipher the blueprint before you can get in there. Karl Crary zheng hao wrote: > Firstly, the contest is really fantastic! I'm still working and have > fun with it. > But a problem stucked me. I can't proceed when I get into the Science > Museum after I use the uploader. Can anyboby tell me how to get into > the History of Sci&Tech Exhibit? > ------------------------------------------------------------------------ > > _______________________________________________ > icfpcontest-discuss mailing list > icfpcontest-discuss@lists.andrew.cmu.edu > https://lists.andrew.cmu.edu/mailman/listinfo/icfpcontest-discuss > From clive at xtra.co.nz Wed Aug 2 17:33:57 2006 From: clive at xtra.co.nz (Clive Gifford) Date: Wed Aug 2 17:34:35 2006 Subject: [icfp-discuss] "Clive's Conjecture" References: <00ac01c6b5c3$cec36580$0301a8c0@Owhango1><00b301c6b5c6$45971880$0301a8c0@Owhango1> <20060802193922.3C6552DDAB@cashew.wolfskeep.com> Message-ID: <006601c6b67b$60c69ee0$0301a8c0@Owhango1> "T. Alexander Popiel" wrote: > > I think that JITs falsify this conjecture; the purpose of a JIT is > to make the hosting cost proportional to code size, not execution > time. A JIT hosted by itself would get compiled, and then would > run at (about) the same speed as the original... effectively unwinding > the N+1 back down to N. > > Of course, I suppose you could argue that a JIT is a compiler, not > an interpreter, and thus the conjecture doesn't apply to it... Yes - I am saying you must be running multiple copies of an identical "interpreter" simultaneously, each copy directly interpreting the next level up the stack. "Interpreters" using JIT compilation are still okay, but each level of the interpreter also has to directly interpret the output of the JIT compiler from the next level up (including the cost of compiling as well). The level N interpreter cannot "cheat" by directly replacing itself with the next level up (if that is possible) or by passing the next level up back down to some other "interpreter" at a lower level (e.g. the CPU of your real machine), etc. Perhaps the best way to summarise this is to say that each interpreter cannot access any lower level interpreter or the underlying machine CPU directly in any way at all. It has to run completely in the environment (VM) created by the underlying layer. I've been trying to think of (or dream up) some candidate languages to experiment with - preferably very simple languages that are still capable of "efficient" self-hosting in the sense described. Any suggestions? From bruce.hoult at gmail.com Wed Aug 2 17:45:36 2006 From: bruce.hoult at gmail.com (Bruce Hoult) Date: Wed Aug 2 17:45:39 2006 Subject: [icfp-discuss] "Clive's Conjecture" In-Reply-To: <006601c6b67b$60c69ee0$0301a8c0@Owhango1> References: <00ac01c6b5c3$cec36580$0301a8c0@Owhango1> <00b301c6b5c6$45971880$0301a8c0@Owhango1> <20060802193922.3C6552DDAB@cashew.wolfskeep.com> <006601c6b67b$60c69ee0$0301a8c0@Owhango1> Message-ID: <391c06200608021445l29057e4cg6a28007a1d0c702a@mail.gmail.com> On 8/3/06, Clive Gifford wrote: > I've been trying to think of (or dream up) some candidate languages to > experiment with - preferably very simple languages that are still capable of > "efficient" self-hosting in the sense described. Any suggestions? If you come up with a good Scheme do let us know! btw, which part of NZ are you in? From clive at xtra.co.nz Wed Aug 2 17:57:01 2006 From: clive at xtra.co.nz (Clive Gifford) Date: Wed Aug 2 17:57:35 2006 Subject: [icfp-discuss] "Clive's Conjecture" Message-ID: <007b01c6b67e$97ee7700$0301a8c0@Owhango1> "Bruce Hoult" wrote: > On 8/3/06, Clive Gifford wrote: > > I've been trying to think of (or dream up) some candidate languages to > > experiment with - preferably very simple languages that are still capable of > > "efficient" self-hosting in the sense described. Any suggestions? > > If you come up with a good Scheme do let us know! > > btw, which part of NZ are you in? ...good Scheme... hmmm... okay! Exactly 39 South, 175 22' 45" East (NZ Geodetic Datum 1949)! From bruce.hoult at gmail.com Wed Aug 2 18:19:53 2006 From: bruce.hoult at gmail.com (Bruce Hoult) Date: Wed Aug 2 18:19:55 2006 Subject: [icfp-discuss] "Clive's Conjecture" In-Reply-To: <007b01c6b67e$97ee7700$0301a8c0@Owhango1> References: <007b01c6b67e$97ee7700$0301a8c0@Owhango1> Message-ID: <391c06200608021519g76100d30hc01d3df11d569022@mail.gmail.com> On 8/3/06, Clive Gifford wrote: > Exactly 39 South, 175 22' 45" East (NZ Geodetic Datum 1949)! I'm glad you specified the datum, as otherwise I would have thought you were in a paddock outside Owhango! From shana.ufie at gmail.com Wed Aug 2 19:48:26 2006 From: shana.ufie at gmail.com (Andreia Gaita) Date: Wed Aug 2 19:48:29 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: References: <20060801094000.GA19452@panix.com> Message-ID: <3ec1038d0608021648g2c363dffhf88029d600f911a4@mail.gmail.com> On 02 Aug 2006 10:18:15 +0200, Christoph Breitkopf wrote: > Jed Davis writes: > > That said, on the 3GHz Pentium 4 I wrote of earlier: > > > > ./um86 sandmark.umz 11.72s user 0.03s system 99% cpu 11.773 total > > Great results. Very! > How about targeting x86_64 instead, and using r8 - r15 directly for > the UM registers? This makes for some very nice instruction mappings (add > translates directly to LEA, for example). If you want room for self-modifying > code, you'll get lots of nops, though. That was exactly what I thought as soon as I read the description of the language, and that's what I'm playing with right now with my amd. It's a pity a day is only 24 hours long... I'm doing an um implementation in assembler (fasm) 32-bits, using only registers for processing opcodes, the only memory access being done is to allocate memory. I have absolutely no idea what kind of times I'll get when I'm done, but it's been fun so far! :) shana gone crazy, back soon, leave message From jonathan.roewen at gmail.com Wed Aug 2 19:49:00 2006 From: jonathan.roewen at gmail.com (Jonathan Roewen) Date: Wed Aug 2 19:49:03 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: <20060802193211.8F9EE2DDD4@cashew.wolfskeep.com> References: <20060801094000.GA19452@panix.com> <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> <20060802193211.8F9EE2DDD4@cashew.wolfskeep.com> Message-ID: > This feels like an unsatisfactory answer... and it seems that > supporting self-modifying code wouldn't be all that expensive. For a JIT, no it wouldn't. I'm curious of Jed's implementation: how much gets JIT'd at once? The whole program? A fixed size block? For a compiler such as I have been attempting, self-modifying code comes at a greater cost as the need for fixed size operations become the greater goal in the design. Jonathan From icfp at jdev.users.panix.com Fri Aug 4 23:13:04 2006 From: icfp at jdev.users.panix.com (Jed Davis) Date: Fri Aug 4 23:13:08 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> References: <20060801094000.GA19452@panix.com> <000001c6b5ae$85cbf0f0$4101a8c0@andromeda> Message-ID: <20060805031304.GA29642@panix.com> On Tue, Aug 01, 2006 at 11:07:39PM +0200, Bart van der Werf wrote: > Great project :) > > How did you handle the issue with self modifying um code ? Currently, every UM instruction gets translated to exactly 12 bytes of machine code. This isn't so good for performance, but it does work. Without -DUM_OSAMEND (i.e., the various features I've been stacking on top are enabled with preprocessor defines), amendments become out-of-line calls to a routine that, if the segment is 0 and the "page" has been translated, updates that one instruction. But that's kind of slow; probably slower than a good interpreter. So, with -DUM_OSAMEND, that "page" is (at least) one VM page of data (1024 platters, since this is i386), and when it's translated the data has its write protection removed. Thus, amendment can be translated into inline code. When something tries to write it, that raises SIGSEGV, and I can get the instruction pointer (see also: SA_SIGINFO, uc_mcontext) and faulting data address, and either flip the written page back to the untranslated state (clearing the protection bits on the code), or (if the page tried to modify itself) retranslate it with the indirect amendments and make the data writable. The problem with this is, of course, how it deals with fine-grained interleaving of code and data -- by considering it self-modifying and dropping it down to "an interpreter would be faster" speeds. For the SANDmark this doesn't appear matter, but for (e.g.) the BLACK verifier it definitely does (as I discovered this morning). (Really, I'm still a little amazed that it works at all; this is the first program I've written that does anything like this.) So I've been slowly planning something more like an actual compiler, which ought to be able to take care of this (among other things) -- the general idea there being to compile an optimized version with certain assumptions about data and control flow, and allow entering it from the 12-byte-code only when and where they're met. -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline))))) From icfp at jdev.users.panix.com Fri Aug 4 23:32:26 2006 From: icfp at jdev.users.panix.com (Jed Davis) Date: Fri Aug 4 23:32:28 2006 Subject: [icfp-discuss] A faster UM. In-Reply-To: References: <20060801094000.GA19452@panix.com> Message-ID: <20060805033226.GB29642@panix.com> On Wed, Aug 02, 2006 at 01:01:04AM +0100, Ganesh Sittampalam wrote: > On Tue, 1 Aug 2006, Jed Davis wrote: > > >there was only so much that could be done within the confines of > >translating each UM insn into exactly 12 bytes of '386 code, and > >allowing branches to arbitrary locations of same > > I guess the "exactly 12 bytes" restriction is so you can compute branches > quickly? Some of that; and also so that individual instructions could, before I added on the VM trapping and such, be replaced individually in the case of self-modification (real or apparent). > How about having two versions of the code. One that obeys this > restriction, and another that allows both variable length and > cross-instruction optimisations. Every so often in the optimised version, > you can have a "resync point" across which optimisations are disabled. > Then replace the corresponding instruction in the unoptimised version with > a branch to this point. This is pretty much what I converged on independently while, uh, not checking my ICFP mailbox the past few days -- except that I'll be extracting assumptions from the initial register state, like "r6 is always 0", which I expect to hold over all uses of the code, and checking them as a condition for entry into the fast path. (I hate writing in the future tense like that, because then if it doesn't work for some reason....) -- (let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map ((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda (f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l)) (C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline)))))