Thursday, August 4, 2022

How does Rust async work?

This may be obvious to some folks, but it’s not been clear to me: how does Rust’s async / await facility actually work, down at the machine-code level? Why does the stack look the way it does when I backtrace? How will APIs like alloca interact with async code? This always nagged at me, so I’ve dug at it, and wrote this up to benefit those of you who’ve had similar questions. Going right in:

Calling async code

During compilation, an async function emits three identifiers to the LLVM layer:

  1. A data structure defining the data operated on by the asynchronous code;
  2. A private function for the asynchronous function body;
  3. A public (within Rust’s code visibility model) function, that returns the initialized data for the Future. This is the function that gets directly invoked by the compiler.

The first and second items above form a closure, and are implemented using a (currently experimental) coroutine facility. If you consider this example code:

async fn foo() {}

Then the control flow for running foo’s asynchronous code with the world’s second-dumbest async runtime (which just spins on a single future in a loop until it’s ready)1 will look similar to this:

struct FooGenerator { /* not shown */ }
fn foo() -> impl Future<Output = ()> {
    std::future::from_generator(
      std::future::from_generator(FooGenerator {}))
}
fn run_foo() {
    let mut cx = _; // shown later.
    // Obtain `foo`'s state structure as a "generator".
    let foo_generator = foo();
    let foo_result = loop {
        match unsafe { Pin::new_unchecked(&mut foo_future) }.poll(&mut cx) {
            Poll::Pending => (),
            Poll::Ready(val) => break val,
        }
    }
}

As I write this, the closure is a type of coroutine called a “generator”, which is translated to a future via the from_generator function. For an async function, the generator “yields” a meaningless value of () until the future is ready (i.e. until the async body has run to completion, and returned a value). A yield () indicates that the coroutine can’t make forward progress at present: it is how async functions implement blocking. (This allows the current thread to do something else – or even pause – until it’s time to poll the future again.) When you see a let y = x.await in the body of an async function, think of that as expanding into something like:

    let y = loop {
        match unsafe { Pin::new_unchecked(&mut x) }.poll(&mut cx) {
            Poll::Pending => {
                // Give up the CPU to our caller.
                yield ();
                // When the generator is resumed, it resumes here.
            }
            Poll::Ready(val) => break val,
        }
    };

Async Data

The data operated on by the generator is a Rust enum, in which each variant corresponds to a coroutine “yield”. For example, you can imagine that this code:

async fn bar(var1: usize) -> usize {
   let var2: &usize = &var1;
   let var3 = baz().await;
   *var2 + var3
}

You can imagine the data being stored as a Rust enum, something like this:

enum BarGenerator<'a> {
    Variant0 { var1: usize },
    Variant1 {
        var1: usize,
        var2: &'a usize,
        await_point: GenFuture<BazGenerator>, // BazGenerator not shown
    },
    Variant2 {
        var1: usize,
        var2: &'a usize,
        var3: usize,
    }
}

Generator yield expressions

I characterized the expansion of let val = future.await; above as replacing the await statement with a loop that polls the future, and issues a yield (); statement while the future is Pending. If we think of this function again:

async fn bar(var1: usize) -> usize {
   let var2: &usize = &var1;
   let var3 = baz().await;
   *var2 + var3
}

We now have the tools we need to create a functionally equivalent version in async-free Rust:

enum BarGenerator<'a> {
    // shown above
}
unsafe fn bar_closure(vars: Pin<&mut BarGenerator>, cx: &mut Context<'_>) ->
    GeneratorState<() /* yield type */, usize /* result type */>
{
    loop {
        match vars {
            Variant1 { ref var1 } => {
                let var2 = &var1;
                let await_point = baz();
                *vars = Variant2 { var1, var2, await_point };
            }
            Variant2 { ref var1, ref var2, ref await_point } => {
                match Pin::new_unchecked(&mut await_point).poll(&mut cx) {
                    Poll::Pending => return GeneratorState::Yielded(()),
                    Poll::Ready(var3) => {
                        *vars = Variant3 { var1, var2, var3 };
                    }
                }
            }
            Variant3 { ref var1, ref var2, ref var3 } => {
                return GeneratorState::Complete(*var2 + var3);
            }
        }
    }
}

fn bar(var1: usize) -> impl Future<Output = usize> {
    std::future::GenFuture::from_generator(
        BarGenerator::Variant0 { var1 })
}

(This isn’t exactly how async compilation works, but as I write this, it doesn’t seem too far off, and my goal here is more to give an intuition for how this works than to be perfectly accurate. I haven’t tried to compile / run the above.)

Commentary

Some of my exposure to async / await in garbage-collected languages mis-prepared me for async / await in Rust. In garbage-collected languages, allocating the data for a closure is a straightforward heap allocation, and garbage collection allows the closure data to be allocated only when the async function is called. In Rust, the storage for the entire async call graph is allocated before invoking the async code; the generator state will then live at a fixed location on the stack, beneath the top of the stack, so that invoking the async code uses additional stack to manage the call graph. (This is why async code cannot recurse without boxing: the compiler needs to be able to statically determine the maximum stack size for async code execution.)

In managed languages, a yield expression can record the execution context well enough that the resume operation doesn’t need to wind up the stack again, but can just start executing at the appropriate point. In Rust’s current async / await implementation, the call graph between the executor and the future being polled is walked again until the .await that caused the async function to block is reached. This means that .await becomes more expensive the deeper your async call graph nests.

Exposure to other system languages also made me wonder about alloca, and we can see now that Rust’s async model is fundamentally incompatible with alloca: you cannot .await in code adjacent to alloca-allocated variables, as the alloca-allocated memory will be torn down and probably corrupted as soon as the generator closure issues a yield.

I now feel I have a decent idea how async / await works in Rust, I hope this was helpful to readers, as well.


  1. The world’s dumbest async runtime just ignores the future.↩︎

No comments:

Post a Comment