i've been thinking about a closely related feature in a different context: adding block arguments, as in smalltalk or ruby or especially lobster, to a language more like c, with static types and stack allocation
i think this would be favorable for (among other things) clu-like iterators and imgui libraries, where you often want to do something like
this is especially useful in a context where you're heap-allocating sparingly or not at all, because the subroutine taking the block argument can stack-allocate some resource, pass it to the block, and deallocate it once the block returns; python context managers and win32 paint messages are two cases where people commonly do this sort of thing, but things like save-excursion, with-output-file, transactional memory, and gsave/grestore also provide motivation
the conventional way to do this is to package up the block into a closure, then use a full-fledged function invocation to invoke it, using a calling convention that supports closures. but i suspect a more relaxed and efficient approach is to use an asymmetric coroutine calling convention, in which the callee yields back control to its caller at the entry point to the block, and the block then resumes the callee when it finishes. so instead of merely dividing registers into callee-saved and call-clobbered, as subroutine calling conventions do, we would divide them into callee-saved upon return but upon yield containing callee values the block must have restored upon resumption; caller coroutine context registers, which are callee-saved upon return and also on yield; and call-clobbered. you also need in many cases a way for the block to safely force an early exit from the callee
this allows the caller's local variables to be in registers its blocks can use without further ado, or at least indexed off of such a register, while allowing the yield and resume operations to be, in many cases, just a single machine instruction. and it does not require heap allocation
as an example of taking this to the point of absurdity, here's an untested subroutine for iterating over a nul-terminated string passed in r0 with a block passed in r1, using a hypothetical coroutine convention which passes at least r4 through from its caller to its blocks
even in this case where both the iteration and the visitor block are utterly trivial, the runtime overhead per item (compared to putting them in the same subroutine) is evidently extremely modest; my estimate is 7 cycles per byte rather than 4 cycles per byte on in-order hardware with simple branch prediction, so, on the order of 1 ns on the hardware russ used as his reference. for anything more complex the overhead should be insignificant
it's less general than the mechanism russ proposes here (it doesn't solve the celebrated samefringe problem), but it's also an order of magnitude more efficient, because the yield and resume operations are less work than a subroutine call, though still more work than, say, decrementing a register and jumping if nonzero
corrected, tested, and commented version of the above example code
@ hashsz s = {
@ h := 53u
@ itersz s, b => {
@ h = (h >> 27 | h << 5) ^ b
@ }
@ return h
@ }
.syntax unified
.thumb
itersz: push {r6, r7, r8, lr} @ r6 and r7 are preserved by blocks
mov r7, r0 @ so put the string pointer in r7
mov r6, r1 @ and the block code pointer in r6
1: ldrb r0, [r7], #1 @ loop over string bytes, post-indexing
cbz r0, 1f @ bail out on NUL, but otherwise
blx r1 @ invoke block argument with byte, then
b 1b @ repeat loop
1: pop {r6, r7, r8, pc} @ restore registers and return
hashsz: push {r4, lr} @ r4 is callee-preserved and passed to blocks
movs r4, #53 @ initial value of hash
adr r1, 1f+1 @ load Thumbified pointer for block using gas local label
bl itersz @ invoke iterator subroutine
mov r0, r4 @ load completed hash into return value
pop {r4, pc} @ restore and return
1: eor r4, r0, r4, ror #27 @ block argument rotates & xors the byte in r0
bx lr @ then resumes the iterator
i think this would be favorable for (among other things) clu-like iterators and imgui libraries, where you often want to do something like
this is especially useful in a context where you're heap-allocating sparingly or not at all, because the subroutine taking the block argument can stack-allocate some resource, pass it to the block, and deallocate it once the block returns; python context managers and win32 paint messages are two cases where people commonly do this sort of thing, but things like save-excursion, with-output-file, transactional memory, and gsave/grestore also provide motivationthe conventional way to do this is to package up the block into a closure, then use a full-fledged function invocation to invoke it, using a calling convention that supports closures. but i suspect a more relaxed and efficient approach is to use an asymmetric coroutine calling convention, in which the callee yields back control to its caller at the entry point to the block, and the block then resumes the callee when it finishes. so instead of merely dividing registers into callee-saved and call-clobbered, as subroutine calling conventions do, we would divide them into callee-saved upon return but upon yield containing callee values the block must have restored upon resumption; caller coroutine context registers, which are callee-saved upon return and also on yield; and call-clobbered. you also need in many cases a way for the block to safely force an early exit from the callee
this allows the caller's local variables to be in registers its blocks can use without further ado, or at least indexed off of such a register, while allowing the yield and resume operations to be, in many cases, just a single machine instruction. and it does not require heap allocation
as an example of taking this to the point of absurdity, here's an untested subroutine for iterating over a nul-terminated string passed in r0 with a block passed in r1, using a hypothetical coroutine convention which passes at least r4 through from its caller to its blocks
and here is another untested subroutine which uses it to calculate a string hash even in this case where both the iteration and the visitor block are utterly trivial, the runtime overhead per item (compared to putting them in the same subroutine) is evidently extremely modest; my estimate is 7 cycles per byte rather than 4 cycles per byte on in-order hardware with simple branch prediction, so, on the order of 1 ns on the hardware russ used as his reference. for anything more complex the overhead should be insignificantit's less general than the mechanism russ proposes here (it doesn't solve the celebrated samefringe problem), but it's also an order of magnitude more efficient, because the yield and resume operations are less work than a subroutine call, though still more work than, say, decrementing a register and jumping if nonzero