WebAssembly and Dynamic Memory

WebAssembly is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust, enabling deployment on the web for client and server applications.

These high-level languages like C/C++/Rust (+Javascript) deal with different allocations of memory, such as static memory, stack memory and dynamic memory.

Static memory allocation refers to assignment of fixed memory addresses during compile and start-time of an application (.data or .bss).

Stack memory allocation refers to a memory region where data is added or removed per function-scope in a last-in-first-out (LIFO) manner.

Dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely malloc, realloc, calloc and free; or the corresponding  boxing in Rust.

This document will focus onto the latter one, the dynamic memory and how it is realized by different toolchains compiling C/C++/Rust  targeting the WebAssembly architecture/platform.

My intention is find the best toolchain to produce WebAssembly-modules being as portable as possible.

The code being used for the investigation is accessible at https://github.com/frehberg/wasm-dyn-mem

High level source code (C/C++/Rust) is compiled to WebAssembly using toolchains like emscripten, clang, or rustc. Emscripten will translate the C/C++ to JavaScript first and this translating to WebAssembly, whereas clang would translate from C/C++ directly to WebAssembly. (Note: avoid emscripten if your code has to deal with int64-primitives, as Javascript would emulate those using floats).

If developing portable WebAssembly modules/apps it is crucial to understand the management of static and dynamic memory and the influence of the compiler toolchain being used.

The security model of WebAssembly has the important goal to protect users from buggy or malicious modules. This is achieved by sandboxing the code, operating on a so called linear memory only.

The linear memory of WebAssembly can be compared to the virtual memory provided to processes running on Windows or Linux. The linear memory is the sandbox the WebAssembly module may operate on, storing the static data in an indexed region, managing the stack memory and managing the dynamic memory required by the app during runtime.

A WebAssembly module could be a library only or an app; the latter requiring a start-function in the module. An app would instantiate global, static constructors (C++) at start-time, whereas calling a function of an library would not (lemma: avoid global static constructors if you want to translate C++ to WebAssembly).

When instantiating a WebAssembly module, it might operate on a newly created  linear memory or it might continue operation on a linear memory provided by the environment.

The management of dynamic memory is provided by so called runtime-libraries such as gnu-libc, musl-libc(embedded),  bionic-libc(Android) or win32; each one implementing the functions malloc(size_t), free(ptr) etc using a different strategy.  Depending on the use-case and platform, different implementations of the actual memory allocation mechanism exist. Their performance varies in both execution time and required memory.

[dlmalloc] Doug Lea has developed dlmalloc (“Doug Lea’s Malloc”) as a general-purpose allocator, starting in 1987. The GNU C library (glibc) uses ptmalloc, an allocator based on dlmalloc.

[jemalloc] Since FreeBSD 7.0 and NetBSD 5.0, the old malloc implementation (phkmalloc) was replaced by jemalloc, written by Jason Evans. The main reason for this was a lack of scalability of phkmalloc in terms of multithreading. In order to avoid lock contention, jemalloc uses separate “arenas” for each CPU. Experiments measuring number of allocations per second in multithreading application have shown that this makes it scale linearly with the number of threads, while for both phkmalloc and dlmalloc performance was inversely proportional to the number of threads.

[Hoard malloc] Hoard is an allocator whose goal is scalable memory allocation performance. Like OpenBSD’s allocator, Hoard uses mmap exclusively, but manages memory in chunks of 64 kilobytes called superblocks. Hoard’s heap is logically divided into a single global heap and a number of per-processor heaps. In addition, there is a thread-local cache that can hold a limited number of superblocks. By allocating only from superblocks on the local per-thread or per-processor heap, and moving mostly-empty superblocks to the global heap so they can be reused by other processors, Hoard keeps fragmentation low while achieving near linear scalability with the number of threads.

[wee_alloc] The Wee_alloc memory allocator has been designed for WebAssembly, and is available for Rust only. It has a tiny code size footprint, compiling down to only a kilobyte of .wasm code. It has been designed for scenarios where there are a handful of initial, long-lived allocations, after which the code does its heavy lifting without any further allocations. This scenario requires some allocator to exist, but we are more than happy to trade performance for small code size.

These implementations of the dynamic memory framework rely on memory pages (usually each page of size 4KB) mapped into the virtual address space of the process in question.

Within each process the actual implementation of the dynamic memory allocator is managing chunks of memory in an intelligent and efficient way.

In case the dynamic memory allocator might run out of memory, it would perform a platform specific call to request more memory pages being mapped into the address space of the process (aka  instance of the wasm-module). Running on top of Linux-Kernel it would be a syscall-function, on other systems it might be the system-call brk(), and WebAssmbly defines the operator grow_memory.

For example the following C application may be translated to a WebAssembly (wasm) module (in fact an app with start function) either using emscripten or using a clang-toolchain (combined with musl-libc)

#include <stdlib.h>
#include <string.h>

#define CHUNK_SIZE (64*1024)
int
main(int argc, char **argv) {
  char* chunk = malloc(CHUNK_SIZE);
  
  // do-something with the chunk, to prevent it is optimized out
  strncpy(chunk, argv[0], CHUNK_SIZE-1);
  if (!strcmp("myapp", chunk)) {
    return EXIT_SUCCESS;
  } else {
    return EXIT_FAILURE;
  }
}

[emscripten] The emscripten toolchain will produce a wasm-binary importing/exporting the following symbols. emscripten is using an implementation of dlmalloc, which is expecting the WebAssembly-environment is exposing the following functions to request and manage memory pages during runtime.

(import "env" "memory" (memory (;0;) 256 256))
(import "env" "DYNAMICTOP_PTR" (global (;0;) i32))
(import "env" "STACKTOP" (global (;1;) i32))
(import "env" "enlargeMemory" (func (;0;) (type 0)))
(import "env" "getTotalMemory" (func (;1;) (type 0)))
(import "env" "abortOnCannotGrowMemory" (func (;2;) (type 0)))
(import "env" "___setErrNo" (func (;3;) (type 1)))
(export "___errno_location" (func 10))
(export "_main" (func 6))
(export "stackAlloc" (func 11))

[musl-libc/clang] The combination of musl-libc and clang-toolchain will produce a wasm-binary importing/exporting the following symbols. Here, musl-libc is providing an implementation of malloc/free being  designed for embedded devices, running an Linux-Kernel or executing standalones. Here the WebAssembly environment is expected to provide these functions.

(import "env" "__syscall3" (func $__syscall3 (type 7)))
(import "env" "__syscall2" (func $__syscall2 (type 5)))
(import "env" "__syscall1" (func $__syscall1 (type 4)))
(import "env" "__syscall6" (func $__syscall6 (type 10)))
(import "env" "__syscall4" (func $__syscall4 (type 11)))
(import "env" "__syscall0" (func $__syscall0 (type 2)))
(export "memory" (memory 0))
(export "_start" (func $_start))
(export "__heap_base" (global 1))

[rustc] Giving Rust a chance, the code for an embedded application being executed in an WebAssembly execution engine might look like the following code,  without dependencies to a std-library, and linking against the wee_alloc implementation of the dynamic memory allocator. The intention of this code is to provoke allocation of dynamic memory using boxing (see the example at https://github.com/frehberg/wasm-dyn-mem).

#![no_std]
#![feature(start)] 

// Required to use the `alloc` crate and its types, the `abort` intrinsic, and a
// custom panic handler.
#![feature(alloc, core_intrinsics, panic_implementation, lang_items, alloc_error_handler)]

extern crate alloc;
extern crate wee_alloc;

use alloc::boxed::Box;

// Use `wee_alloc` as the global allocator.
#[global_allocator]
static ALLOC: wee_alloc::WeeAlloc = wee_alloc::WeeAlloc::INIT;

// Need to provide a tiny `panic` implementation for `#![no_std]`.
// This translates into an `unreachable` instruction that will
// raise a `trap` the WebAssembly execution if we panic at runtime.
#[panic_handler]
#[no_mangle]
pub fn panic(_info: &::core::panic::PanicInfo) -> ! {
    unsafe {
        ::core::intrinsics::abort();
    }
}

// Need to provide an allocation error handler which just aborts
// the execution with trap.
#[alloc_error_handler]
#[no_mangle]
pub extern "C" fn oom(_: ::core::alloc::Layout) -> ! {
    unsafe {
        ::core::intrinsics::abort();
    }
}

#[start]
fn start(_argc: isize, _argv: *const *const u8) -> isize {
unsafe {
  let appname = _argv.offset(0);
  let mut chunk: Box<[u8]> = Box::new([0; 10]);

  let mut i:isize = 0;
  for i in 0..6 {
    chunk[i as usize] = **(appname.offset(i as isize)) ;
  }
  match &chunk[..6] {
      b"foobar" => 1,
       _ => 0
    }
}
}

Compiling this code the resulting WebAssembly module will have the following dependencies to the WebAssmbly runtime environment

(export "memory" (memory 0))
(export "__indirect_function_table" (table 0))
(export "__heap_base" (global 1))
(export "__data_end" (global 2))
(export "main" (func $main))

As you might notice there are no explicit dependencies as created by emscripten or musl-libc/clang of the previous examples. Here the Rust implementation of the dynamic memory allocator wee_alloc is relying on the WebAssembly specific operator grow_memory.

(func $_$LT$wee_alloc..LargeAllocPolicy$u20$as$u20$wee_alloc..AllocPolicy$LT$$
u27$a$GT$$GT$::new_cell_for_free_list::h33e771c3c233d123 (type 1) (param i32 i32
i32 i32)
block ;; label = @1
i32.const 0
get_local 2
i32.const 2
i32.shl
tee_local 2
get_local 3
i32.const 3
i32.shl
i32.const 16384
i32.add
tee_local 3
get_local 3
get_local 2
i32.lt_u
select
i32.const 65543
i32.add
tee_local 2
i32.const 16
i32.shr_u
grow_memory

Conclusion

In terms of portability the clear winner is the Rust compiler toolchain in combination with the dynamic memory allocator wee_alloc. 

Its resulting WebAssembly code has not a single dependency/import to any another symbol/function-name being exposed by the runtime environment, and this lean  code might be executed in any WebAssembly environment.

In contrast, when using the toolchains like emscripten or musl-libc/clang these will add dependencies, importing symbols form the WebAssembly runtime to request and manage the memory pages, either depending on those  __syscall functions or the function enlargeMemory and more.

These functions may be specific for Javascript environments or custom environments only, and would prevent execution in other WebAssembly environments.

Outlook

Next I want to investigate the combination of Rust and its std-allocator, an implementation of jemalloc


%d bloggers like this: