Project Description

Hello! This was a mini competition run by the vulnerability research group @RITSEC during the start of Spring 2026. We studied modern allocation implementation specifically Mimalloc, PartitionAlloc, TCMalloc, Scalloc, Scudo and bmalloc - our slides can be found on https://vri.group. The test-benchmark was created by Oleg and tests basic allocator behavior, memory fragmentation, throughput, tail latency, and a bunch of security edge cases.

Benchmark

At a high level the benchmark harness is allocator agnostic: you drop in a malloc/free implementation and the test runner drives it through several suites. The correctness suite covers basic malloc/free, realloc, calloc, alignment and size boundary behavior from tiny objects all the way up to near SIZE_MAX. The stress and threaded suites hammer the allocator with long realloc chains, 1M+ ops, producer consumer patterns, and mixed size workloads to shake out races and performance cliffs.

There is a dedicated fragmentation suite that uses “Swiss cheese” style patterns and peak live set cycles to measure how much RSS is retained after frees, plus medium and large tier fragmentation tests. The edge and security tests probe weird cases like invalid frees, double frees, stack frees (“house of spirit”), heap poisoning (“house of lore”), and top chunk abuse (“house of force”) to make sure an allocator either safely ignores or explicitly aborts on bad behavior instead of silently corrupting metadata.

Finally the realistic workloads replay traces from things like Redis YCSB, SQLite TPC-C, Firefox page loads, and mixed lifetime application patterns. For each workload the harness records overall throughput (ops/sec), latency percentiles (p50 / p99 / p999), and peak / final RSS, which makes it easy to compare custom allocators like SHARE_RD against glibc and modern production allocators.

Blogs

  1. zialloc
  2. share-rdAlloc
  3. dualalloc

Github

  1. Aeshus
  2. dualalloc
  3. Zialloc
  4. share-rdAlloc

Zialloc

Zialloc Design

DISCLAIMER: I’m somewhat continuing to work on this project, so there is no promise that this design doc is up-to-date regarding security features and execution path. Layout won’t change though.

Size Model

Zialloc uses fixed size classes for regular allocations plus a directly-mapped XL path.

  • Reserved heap virtual address space (default): 100GB
  • Segment size / alignment: 128MiB
  • Page classes: SMALL (1MiB), MEDIUM (8MiB), LARGE (16MiB), XL (direct mapping)
  • Chunk size thresholds used for class selection: 512KiB - 16B (small), 4MiB - 16B (medium), 8MiB - 16B (large), above this threshold is XL

Regular-sized requests are bucketed/aligned before placement:

  • XL path uses 16-byte alignment.
  • Small/Medium/Large classes use a normalized chunk model: round request to at least 16, then round up to next power-of-two, then clamp to class cap.
  • Final stride in page = align_up(normalized_request + inline_header, 16).

Heap Layout

At initialization, zialloc reserves a large vmem region (currently 100GB) and commits segments from it on demand (128MiB). It immediately seeds one segment each for small, medium, and large classes.

Hierarchy

Each segment is classed by size, meaning all pages in that segment have the same page size.

  • Small segment: 1MiB pages
  • Medium segment: 8MiB pages
  • Large segment: 16MiB pages

Each page is then subdivided into fixed-size chunk slots for that page instance, allowing us to avoid any sort of coalescing logic and worrying about potential fragmentation. Within one page, all slots have identical stride/usable size and allocation state is tracked with a bitmap.

XL allocations semi-bypass the page/segment class system and are mmapped as standalone mappings with inline XL headers.

Metadata model is entirely allocator-owned(OOL):

  • Chunks can resolve their owning page and slot idx using pointer arithmetic on themselves
  • Per-page metadata: bitmap, used counts, owner TID, deferred-free ring
  • Per-segment metadata: class, page array, full-page count, chunk-geometry lock-in, integrity key/canary
  • XL metadata is inline in front of returned pointer (XLHeader)

Allocation Workflow

Allocation starts with API wrappers (malloc, calloc, realloc) and then goes through HeapState::allocate(size).

Main behavior:

  • malloc/calloc/realloc ensure heap init has happened, then validate request size.
  • Size class is computed (SM/MD/LG/XL) using configured chunk-class thresholds.
  • Even if classified as XL, there is a second chance for large-page fit: if request can still fit a large-page chunk geometry (<= LARGE_PAGE_SIZE - sizeof(ChunkHeader)), allocator reroutes to large class; otherwise true XL mapping is used.
  • Non-XL fast path is searching thread-local cached pages by class. If the executing thread contains a page that matches/satisfies the requested allocation size, it will return that since this is still faster than following the next path.
  • Next path is searching a thread-local preferred segment (there are multiple; sorted by class).
  • Next path is a shard queue of known non-full segments.
  • Next path is a bounded scan of same size classed segments.
  • Slow path grows heap by carving another segment out of our pre-reserved virtual address space.
  • The final fallback mmaps a new segment-aligned mapping if the current reserved region cannot satisfy the request.

Bitmap/chunk behavior

The chunk allocator inside a page is bitmap driven:

  • Page tracks a used_bitmap where 1 = in use and 0 = free.
  • Allocation searches from a hint, finds the first zero bit, marks it, writes a chunk header, and returns a pointer after the header to the user.
  • Free validates header/magic/owner/slot, clears the bit, decrements used count, and updates first_hint for future faster reuse.
  • Double free detection is possible by seeing if a bitmap bit is already clear and aborting.

Alloc flow

Free Workflow

Free enters through free() and dispatches to HeapState::free_ptr(ptr, usable_out). Its important to note that ‘freeing’ here doesn’t mean releasing memory to the OS. It just means undoing physical mappings.

Main behavior:

  • Null free is ignored.
  • For regular allocations, inline header is parsed (ChunkHeader) and magic value is checked (CHUNK_MAGIC) for corruption - an invalid state would abort().
  • Owning page/segment is resolved from header pointer links.
  • If freeing thread is not page owner thread, allocator attempts a deferred enqueue into page-local lock-free ring first.
  • If deferred enqueue fails (queue full/contention), it falls back to direct free on page.
  • Deferred frees are drained by owner-side allocation path opportunistically whenever the queue pressure is moderately high.
  • Chunk “free()” itself is just a bitmap bit clear + used chunk count decrement (+ optional zero-on-free).
  • For XL pointers, allocator checks XL_MAGIC, optionally zeroes payload, and unmaps entire mapping.
  • Invalid/untracked pointers are going to report failure from dispatch API and the caller will abort.
  • Thread cache is then updated to track the current page for faster future hits.

Deferred-free ring unintended bonus

The Deferred-free queue is a bounded per-page ring used to defer remote-thread mutation of pages it doesn’t own. This gives us a cheeky capability for detecting UAFs (if checks are enabled) and can delay reuse thus acting as a pseudo temporal quarantining mechanism by preventing any writes to pointers currently in the queue.

Free flow

Security Strategy

Zialloc uses a few different integrity checks plus optional hardening toggles…

Current controls/checks:

  • Pointer/header ownership checks before free and usable-size operations
  • Abort-on-corruption for invalid headers, bad transitions, and detected double frees
  • Segment integrity key/canary check in validation path
  • Optional zero-on-free memory scrubbing
  • Optional UAF check path in usable_size (aborts if the slot is no longer marked as allocated)

Known Limits

  • Heap layout itself isn’t optimal
  • Metadata lookup/access isn’t as good as radix trees.
  • XL allocations are direct mapped and behavior differs from class-segmented allocations.
  • Segment classing and fixed chunk geometry per segment trade memory efficiency for predictable behavior and scan reduction.
  • Deferred ring for cross-thread frees is capped and may fall back to direct page free.
  • Thread-aware fast paths improve latency but add complexity and state coupling.

Source Map

  • API entrypoints, init/teardown, stats:
    • zialloc/alloc.cpp
  • Core allocator internals (heap/segment/page/cache/deferred free):
    • zialloc/segments.cpp
  • OS mapping/protection/reservation wrappers:
    • zialloc/os.cpp
  • Shared constants/macros/enums:
    • zialloc/types.h
    • zialloc/mem.h
  • Memory interface declarations used across units:
    • zialloc/zialloc_memory.hpp

API Surface

Supported APIs:

  • malloc
  • free
  • realloc
  • calloc
  • usable_size
  • print_stats
  • validate_heap
  • get_stats
  • init
  • teardown

currently not implemented:

  • memalign
  • aligned_alloc
  • free_sized
  • realloc_array
  • bulk_free

share-rdAlloc

General Overview

The SHARE_RD, allocator is a 3 tiered allocator based on the fundamental idea that we, allocate a lot of OS VM memory during initializations and have very fast allocations after that. Its overall design is a mix of several modern allocators with a good bit of inspirations from Google’s TC allocator. Which also has a similar choice of tier allocations paths, though the Google general purpose allocator does quite a bit more processing to decide how certain things in its primary fast path can be stored (lots of sampling). The name of the allocator is based on an amazing joke where I kept calling a friend of mine Shard for like 3 hours lol. Anyway on to the actual technical meat !

SHARE_RD Allocation Path

Allocation Path Diagram

SHARE_RD_malloc

void *SHARE_RD_malloc(size_t size) {
    if (size == 0 || size > (size_t)-1 - 4096)
        return NULL;
    if (size > MAX_LARGE_ALLOC_SIZE)
        return MASSIVE_ALLOC(size);
    if (size > MAX_SMALL_SIZE)
        return medium_alloc(size);
    return small_alloc(size);
}

This is our entry point for our overridden system malloc, we check the size against our max size values for the 3 different allocation paths. The max sizes are MAX_LARGE_ALLOC_SIZE = 8MB and MAX_SMALL_SIZE = 512 KB, our 3 allocation pathways as such are MASSIVE_ALLOC > 8MB, medium_alloc = 8MB-512KB, and small_alloc = 512KB-0.

Heap Format:

lower addresses                                               higher addresses
┌────────────────────────────────────────────────────────────────────────────┐
│                           1 TB Virtual Memory Alloc @ Start                │
├────────────────────────────────────────────────────────────────────────────┤
│                             Small Heap 512 GB                              │
│          [ small_region_base .. small_region_base + 512 GB ]               │
│                                                                            │
│   small spans, many size classes (0 .. 512 KB) managed by small_alloc      │
├────────────────────────────────────────────────────────────────────────────┤
│                            Medium Heap (512 GB)                            │
│        [ medium_region_base .. medium_region_base + 512 GB ]               │
│                                                                            │
│           5 contiguous zones 102 GB each, with one zone                    |  
|            one per medium size: 512 KB, 1 MB, 2 MB, 4 MB, 8 MB             │
└────────────────────────────────────────────────────────────────────────────┘

Massive allocations (MASSIVE_ALLOC) are backed by mmap and in effect treated 
as values past this pre‑reserved 1 TB space.

MASSIVE_ALLOC

This is by far the simplest allocation pathway in my allocator, where I am simply using the mmap function to ask the OS for memory and then munmap to deallocate memory. Because of how slow that is I obviously only want to do this on very large allocations which in this case is defined as anything above 8MB - found this to work fine for my needs. The idea behind the large allocations is we need to store very little data,

+-----------------------------------------------------------------+
|                     large_header_t (metadata)                   |
|-----------------------------------------------------------------|
| total_size : size of whole mapping (incl. header, page-rounded) |
+-----------------------------------------------------------------+
|                                                                 |
|                         user payload                            |
|                memory location returned by the function         |
|                                                                 |
|   requested size bytes  (<= total_size - sizeof(large_header))  |
|                                                                 |
+-----------------------------------------------------------------+

typedef struct {
    // 8 bytes 
    size_t   total_size; // large: full mmap size for munmap; medium: 0 
    uint8_t  _pad[8]; // 8 bytes to make sure the total size is 16 bytes for alignment
} large_header_t; 

The only real data I care to store in my massive struct is the size of the allocation which ofc is needed to know the free range. Internally though the massive allocations use the “mah” or massive allocation hashmap to store every single large allocation entry. Its a fairly standard hashmap implemented in the mah_hashmap.c/h files, with an entry struct and hashmap.

typedef struct massive_alloc_hashmap_entry {
    void *ptr; // ptr to start of the alloc 
    size_t size; // size of the alloc 
    struct massive_alloc_hashmap_entry *next;
} massive_alloc_hashmap_entry_t;

typedef struct {
    massive_alloc_hashmap_entry_t **entries; // ptr to arry of entries
    size_t size; // size of the arry of entries
    size_t count; // # of the entries in the hashmap
} massive_alloc_hashmap_t;

The hashmap implements your standard create,insert,lookup and hash is defined with the MurmurHash3:

static unsigned long mah_hash(void *ptr, size_t table_size) {
    uintptr_t k = (uintptr_t)ptr;
    k ^= k >> 33;
    k *= 0xff51afd7ed558ccdULL;
    k ^= k >> 33;
    k *= 0xc4ceb9fe1a85ec53ULL;
    k ^= k >> 33;
    return (unsigned long)(k % table_size);
}

We have a single global hashmap that we use to lookup and store large allocations - we use a mutex to make sure we don’t any issues with race conditions.

void MASSIVE_FREE(void *ptr) {
    if (ptr == NULL)
        return;
    pthread_mutex_lock(&large_map_lock);
    massive_alloc_hashmap_entry_t *entry = mah_lookup(large_map, ptr);
    if (entry == NULL) {
        pthread_mutex_unlock(&large_map_lock);
        abort();
    }
    size_t total = entry->size;
    mah_remove(large_map, ptr);
    pthread_mutex_unlock(&large_map_lock);

    void *hdr = (char *)ptr - sizeof(large_header_t);
    munmap(hdr, total);
}

The last interesting thing I support, as a byproduct of the test-bench suite our team defined is kernel huge pages. This is a means for telling the linux kernel to use the 4MB page sizes for allocations to allow for better large allocations which I support by using a simple preprocessor flag

void *MASSIVE_ALLOC(size_t size) {
    size_t total = sizeof(large_header_t) + size;
    total = round_up_to_page(total);

    // try to use huge pages if available
#if TRY_HUGE_PAGES
    if (total >= HUGE_PAGE_SIZE) {
        size_t huge_total = (total + HUGE_PAGE_SIZE - 1) & ~(HUGE_PAGE_SIZE - 1);
        void *base = mmap(NULL, huge_total,
                         PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_2MB,
                         -1, 0);
        if (base != MAP_FAILED) {
            large_header_t *hdr = (large_header_t *)base;
            hdr->total_size = huge_total;
            void *payload = (char *)base + sizeof(large_header_t);
            ensure_large_map();
            pthread_mutex_lock(&large_map_lock);
            mah_insert(large_map, payload, huge_total);
            pthread_mutex_unlock(&large_map_lock);
            return payload;
        }
    }
#endif
// default to normal 4k pages otherwise 
    void *base = mmap(NULL, total,
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS,
                     -1, 0);

    if (base == MAP_FAILED)
        return NULL;

    large_header_t *hdr = (large_header_t *)base;
    hdr->total_size = total;

    void *payload = (char *)base + sizeof(large_header_t);
    ensure_large_map();
    pthread_mutex_lock(&large_map_lock);
    mah_insert(large_map, payload, total);
    pthread_mutex_unlock(&large_map_lock);

    return payload;
}

The free system and the reason the hashmap exists are to protect my allocator against security issues for example, the hash map prevents us from freeing say stack memory since the object wouldn’t be in the hashmap. Any object just not located in the hashmap also ofc will not be freed.

void MASSIVE_FREE(void *ptr) {
    if (ptr == NULL)
        return;
    pthread_mutex_lock(&large_map_lock);
    massive_alloc_hashmap_entry_t *entry = mah_lookup(large_map, ptr);
    if (entry == NULL) {
        pthread_mutex_unlock(&large_map_lock);
        abort();
    }
    size_t total = entry->size;
    mah_remove(large_map, ptr);
    pthread_mutex_unlock(&large_map_lock);

    void *hdr = (char *)ptr - sizeof(large_header_t);
    munmap(hdr, total);
}

Medium_alloc

The medium path handles sizes from 512 KB up to 8 MB. It lives in the second half of the 1 TB virtual region (the medium heap). There are five fixed size classes: 512 KB, 1 MB, 2 MB, 4 MB, and 8 MB. At init, the 512 GB medium region is split into five contiguous zones of about 102 GB each, one zone per size class. Within a zone we bump-allocate blocks; when a block is freed it goes onto a per-class free list so the next allocation of that size can reuse it without touching the bump pointer. Each size class has its own mutex so different classes can be allocated in parallel. I found this approach to be a good mix of speed and simplicity. Having the idea of “zones” allows us to fully avoid having to do coalescing in any shape since every allocation will be in the given size class. This is an idea I apply throughout my allocator design and I think it helps me keep my speeds up.

The overall memory layout of a allocation is roughly the same idea with the primary difference being us tracking the size class of the allocation since we have 5 possible sizes.

+-----------------------------------------------------------------+
|                   medium_header_t (16 bytes)                    |
|  size_class_idx (4) | _pad (4) | obj_size (8)                   |
+-----------------------------------------------------------------+
|                                                                 |
|                         user payload                            |
|                (slot_size = medium_class_sizes[size_class])     |
|                                                                 |
+-----------------------------------------------------------------+

We have a very simple look up system to find the actual size class that we define. the over

typedef struct {
    uint32_t size_class_idx; // 4
    uint32_t _pad; // 4
    size_t   obj_size; // 8
    // 16 bytes for alignment 
} medium_header_t; //  512KB - 8mb, inside of the main 1TB of memory

Size class selection is done by rounding up the requested size to the next medium class in the header:

#define MEDIUM_CLASS_COUNT 5

static const size_t medium_class_sizes[MEDIUM_CLASS_COUNT] = {
    512 * 1024, // 512KB
    1 * 1024 * 1024, // 1MB
    2 * 1024 * 1024, // 2MB
    4 * 1024 * 1024, // 4MB
    8 * 1024 * 1024, // 8MB
};

static inline int find_size_class(size_t size) {
    // hardcoded size classes for medium allocations
    // 512KB, 1MB, 2MB, 4MB, 8MB
    if (size <= 512 * 1024) return 0;
    if (size <= 1 * 1024 * 1024) return 1;
    if (size <= 2 * 1024 * 1024) return 2;
    if (size <= 4 * 1024 * 1024) return 3;
    return 4; // 8MB
}

Internally each medium class has a free list, a bump pointer, and the end of its zone. Allocation locks that class, then either pops from the free list or bumps within the zone. Free validates the pointer is inside the medium region and that it has not already been freed (FREED_MAGIC), then pushes the block onto the class free list.

typedef struct medium_class {
    void *free_list; // start of free blocks 
    void *bump; // next free addr 
    void *zone_end; // end of zone 
    pthread_mutex_t lock; // lock for the class 
} medium_class_t;

The main idea to remember is my system is since we have size zones every allocation we can assume is a valid allocation if we are even in the pathway. Which means we never have to actually validate anything in our free list object. Since every allocation is good, which means we can just pop the first idea and keep going with our life. This in effect means the fast path in our fast path is simply doing an O(1) remove operation from the free-list. The free list starts at 0 and is built up throughout execution time as objects get freed.

void *medium_alloc(size_t size) {
    // we need to round up, then find the correct class size and then allocate from the class
    // make sure to lock the class before allocating
    int size_class = find_size_class(size);
    // in header, and gives us back the size class index

    size_t slot_size = medium_class_sizes[size_class];
    size_t block_size = sizeof(medium_header_t) + slot_size; // header is 16 bytes

    pthread_mutex_lock(&medium_classes[size_class].lock);

    void *base = NULL;


    if (medium_classes[size_class].free_list != NULL) {
        // if we have a free list, we can allocate from the free list
        base = medium_classes[size_class].free_list; 
        medium_classes[size_class].free_list = *(void **)base; // store next in the next 8 bytes of the block
    } else {
        // if we dont' have a free list we need to allocate from the bump
        // this happens at the start of the zone and is used to allocate the first block
        // or if we are allocating more data then we are freeing 
        // the free list stores free'd blocks and we can just allocate from the free list
        void *bump = medium_classes[size_class].bump;
        void *zone_end = medium_classes[size_class].zone_end;
        if ((char *)bump + block_size > (char *)zone_end) {
            // if we can't allocate from the bump, we need to return NULL
            pthread_mutex_unlock(&medium_classes[size_class].lock);
            return NULL;
        }
        base = bump;
        medium_classes[size_class].bump = (char *)bump + block_size;
    }

    pthread_mutex_unlock(&medium_classes[size_class].lock);
    if (base == NULL) {
        // if we couldn't allocate, we need to return NULL
        return NULL;
    }

    medium_header_t *hdr = base;
    hdr->size_class_idx = size_class;
    hdr->obj_size = slot_size;
    return (char *)base + sizeof(medium_header_t);
}

The free pathway for medium allocation is fairly simple and I think its a easy middle ground between the complexity of the small allocation system and the very blunt munmap of the massive allocation pathway. The overall idea is same, many of the security test cases require us to add validation on the free functions, making sure our pointers are actually in the medium region, we aren’t freeing objects outside of the region, making sure its a real object and not a fake object created by the user. I validate the double free by checking to see if I have a simple magic value. The value is created at init time during the execution of the allocator. And after an object is freed it will take the first 8 bytes of the allocation. After we are sure its a safe object to free, I add the magic and add the object to my free list.

void medium_free(void *ptr) {
    // got to make sure its even real lol 
    if (ptr == NULL)
        return;


    // validate the pointer is aligned to the medium alignment
    uintptr_t p = (uintptr_t)ptr;
    if ((p & (MEDIUM_ALIGN - 1)) != 0)
        return;

    // make sure we have a global base pointer to the medium region
    if (medium_region_base == NULL)
        return;

    // get the header address and make sure its within the medium region
    // this makes sure we don't free smth in like the stack or another regions 
    // ie protects agasint some of the bechmarks 
    uintptr_t region_base = (uintptr_t)medium_region_base;
    uintptr_t hdr_addr = p - sizeof(medium_header_t);
    if (hdr_addr < region_base || hdr_addr >= region_base + medium_region_len)
        return;

    // we use the magic number to make sure we don't free the same block twice
    // its super cheap to validate via a 8 byte generated magic num
    medium_header_t *hdr = (medium_header_t *)hdr_addr;
    if (*(uintptr_t *)((char *)hdr + 8) == FREED_MAGIC)
        return;

    uint32_t size_class = hdr->size_class_idx;

    if (size_class >= (uint32_t)MEDIUM_CLASS_COUNT)
        return;

    // lock the class and add the block to the free list
    // allows us to create a fast path for allocating from the free list
    pthread_mutex_lock(&medium_classes[size_class].lock);
    *(void **)hdr = medium_classes[size_class].free_list;
    *(uintptr_t *)((char *)hdr + 8) = FREED_MAGIC; // set the freed magic num 
    medium_classes[size_class].free_list = hdr;
    pthread_mutex_unlock(&medium_classes[size_class].lock);
}

Small_alloc

The small path handles all sizes from 0 up to 512 KB. It uses the first 512 GB of the 1 TB heap (the small heap). There are 16 size classes: 16, 32, 64, 128, 256, 512, 1K, 2K, 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K bytes. The small heap is carved into 2 MB spans (SPAN_SIZE); each span belongs to a single size class. A radix tree maps any pointer in the small region to the metadata for its span, so on free we can quickly find the size class and span without storing a header in every small block.

small allocation memory model

small region: first 512 GB of the 1 TB heap

lower addresses                                                  higher addresses
+----------------------+----------------------+----- ... -----+----------------------+
| span 0 (2 MB)        | span 1 (2 MB)        |               | span N (2 MB)        |
| size class 16 B      | size class 4 KB      |               | size class 512 KB    |
+----------------------+----------------------+----- ... -----+----------------------+

example span inside the 512 GB small region (1 slab = 2 MB)

span_base
+----------------------------------------------------------------------------+
| first 8 bytes: pointer to small_span_metadata_t                            |
+----------------------------------------------------------------------------+
| obj0 | obj1 | obj2 | obj3 | obj4 | obj5 | obj6 | obj7 | ...                |
| 64 B | 64 B | 64 B | 64 B | 64 B | 64 B | 64 B | 64 B | (size class objs)  |
+----------------------------------------------------------------------------+

free list for this span, all objs of same size class 
obj0 -> obj3 -> obj7 -> ... -> NULL  

I apply a very similar mindset to building the objects and structure of my heap here as I did in the medium allocation pathway. We will never have to worry about fragmentation per span which allows us to simply just allocation any free object in that span. We once again track the freed object instead of the allocated objects. How we actually get objects and memory is a bit different since we can’t just move our bump ptr instead we have to refill using a central free list, we do this in order to keep per thread allocation amount more controlled to better work with the cpu model for caching.

Since we obviously can not do metadata per allocation for our small allocation pathway since that would waste way too much space we instead do it per span. This is very very similar to how Google’s TC-malloc choices to implement their fast path / pre-thread allocation system. With that said lets look at our per-span meta data object, this should give us a good idea in terms of what we want to do, in the following object I ofc keep track of the size class of my span, this is necessary to make sure allocation are safe and free’s are valid.

#define SMALL_SPAN_SIZE  (2 * 1024 * 1024) // 2MB

struct small_span_metadata {
    uint32_t size_class;
    uint32_t _pad;
    void    *free_list; // LNK list of free objects in the span
    small_span_metadata_t *next; // next span with free objs of the same size class
    void    *span_base; 
    int      on_central_list; // flag to validate the span is on the central free list
};

The free list here exists exclusively in this object and we actually do not need to create a mutex here since we will be storing every single small_span_metadata in a central free list which we will mutex on itself. That gets me to my on_central_list, we track if the span is in the central free list or if the data in it is being used in the small_thread_cache. The idea here is since we will be refilling our central free list in spans’ we never want to make it clear when a span’s data is located on the central list or the per thread cache.

// our central list obj
typedef struct {
    small_span_metadata_t *span_list;
    pthread_mutex_t        lock;
} small_central_t;

The thread cache is just as simple with the main difference being the fact we are using the __thread declaration in order to create a per-thread object. The system works us having a free list of size classes with the number of objects stored per thread, this means each thread will have its own cache with a number of usable allocatable objects, we track this number and if we every need more we will contact our lock base central_list and refill the cache. Whats important to note here is that we do not actually need to keep track of the spans at all here, they exist as a higher level abstraction for the central list, optimization memory fragmentation and ofc human understanding.

static __thread struct {
    void   *free_lists[SMALL_CLASS_COUNT];
    size_t counts[SMALL_CLASS_COUNT];
} small_thread_cache;

If the overall design is still a bit fuzzy thats fair, I think going into the actual allocation function should help quite a bit. The general idea here is that we figure out the size_class we are working with and from there our goal is to ideally use the small_thread_cache’s free list to get objects, we use the span meta data connection to protect against attacks which might create fake heap objects and try to get us to allocate that. Regardless the main idea is we return this object and decrement the count of th given size class.

void *small_alloc(size_t size) {
    int size_class = find_size_class_small(size);

    if (size_class >= SMALL_CLASS_COUNT) // this should never happen, but protects against crafted heap exploits 
        return NULL;
    // like the medium allocator, the free list is our fast path for allocations
    // but this time we have a thread local free list for each size class
    if (small_thread_cache.free_lists[size_class] != NULL) {
        void *obj = small_thread_cache.free_lists[size_class];
        // validate the pointer is actually from the correct size class
        // protects agsint house of lore 
        small_span_metadata_t *meta = span_meta_from_ptr(obj);
        if (!meta || meta->size_class != (uint32_t)size_class)
            abort(); // if not, we need to abort the process

        small_thread_cache.free_lists[size_class] = *(void **)obj;
        small_thread_cache.counts[size_class]--;
        *(uintptr_t *)((char *)obj + 8) = 0;
        return obj;
    }

The above can be thought of as the “fast path of the fast path”, if we don’t have any free objects then we need to go and find them in our central free list and we use the refill func to do this, it will return a simple array of ptrs which we can use to add to our small_thread_cache.free_lists. Its also worth noting that we control of the number of objects per batch and set it too SMALL_BATCH_SIZE=256, this was me messing around with numbers for our test suite and found this to work the best since it reduced the number of times I had to actually call the batch size a good bit.


    void *batch[SMALL_BATCH_SIZE]; // batch refill 256 objects at a time
    size_t got = refill(size_class, batch, SMALL_BATCH_SIZE); // refill the batch
    if (got == 0)
        return NULL; // OOM issue if we can't refill the batch

    for (size_t i = 1; i < got; i++) { // go through the got ptrs and add them to the thread local free list
        *(void **)batch[i] = small_thread_cache.free_lists[size_class];
        small_thread_cache.free_lists[size_class] = batch[i];
    }
    // update the thread local free list count
    small_thread_cache.counts[size_class] = (size_t)(got - 1);
    *(uintptr_t *)((char *)batch[0] + 8) = 0; 
    return batch[0];
}

The idea behind the refill function is fairly simple, if we have space in the central free list for a given size class then we will just provide the objects to the request and be done with it. If we do not have any of the objects that we must allocate ourselves a new span in order to appease the request. Finally if the meta->free_list is fully empty and has no more objects to use then we can remove the object from the central free list and we track that it is no longer on the central_list at all. This is important for free pathways later on. This is also obviously the slow pathway since we have a central linked list of objects we want to use, since we are batching we ideally do not have to enter this function too often, for projects that do a lot of small allocation it would be worth increasing the batch amount.

static size_t refill(int size_class, void **batch, size_t want) {
    size_t got = 0; // # of objs refilled 
    // we check the central free list for free objects
    pthread_mutex_lock(&central[size_class].lock);

    while (got < want) { // loop till we have at least # want objs refilled 
        small_span_metadata_t *meta = central[size_class].span_list;
        while (meta && meta->free_list == NULL)
            meta = meta->next;

        if (meta == NULL) {
            meta = new_span(size_class); // if our central free list is empty, we need to allocate a new span
            if (meta == NULL) // OOM issue if we can't alloc a new span
                break;
        }

        while (got < want && meta->free_list != NULL) { // loop till we have at least # want objs refilled 
            void *obj = meta->free_list;
            meta->free_list = *(void **)obj;
            batch[got++] = obj;
        }

        if (meta->free_list == NULL) {
            small_span_metadata_t **p = &central[size_class].span_list;
            while (*p != meta)
                p = &(*p)->next;
            *p = meta->next;
            meta->next = NULL; // unlink the meta from the central free list
            meta->on_central_list = 0; // set the on_central_list flag to 0
        }
    }

    pthread_mutex_unlock(&central[size_class].lock); // unlock the central free list
    return got;
}

Yeah… this got pretty confusing pretty fast, hopefully the reason my system works in such a way makes sense, it was a lot of building up on the previous section and thinking “how can I make this faster”. Regardless on to the final part of my writeup/blog post - the free ! The code for this section is also pretty straight forward, we validate the free ptr is actually a real object and that we don’t have the magic anywhere. The more interesting part about how the code works is our maintenance of the pre cache free list size, we want to make sure that our free list will never go beyond the defined size #define SMALL_MAX_CACHE (2 * SMALL_BATCH_SIZE) in order to keep the pre-cache thread free-list optimized for the CPU caching. I honestly should play with this number a bit more but didn’t have time to really figure out how do that well, the amount works and the idea here is once we go beyond that size we simply will flush objects back to our central free list

void small_free(void *ptr) {
    if (*(uintptr_t *)((char *)ptr + 8) == FREED_MAGIC)
        return;

    small_span_metadata_t *meta = span_meta_from_ptr(ptr);
    if (!meta)
        return;

    uint32_t size_class = meta->size_class;
    *(void **)ptr = small_thread_cache.free_lists[size_class];
    *(uintptr_t *)((char *)ptr + 8) = FREED_MAGIC; // same free magic as the medium allocator
    // same idea we want to keep a linked list of free objects for each size class for fast allocations
    small_thread_cache.free_lists[size_class] = ptr; 
    small_thread_cache.counts[size_class]++; // update the thread local free list count

    // we have a max cache size to optimze around cpu caching and avoid too much contention
    if (small_thread_cache.counts[size_class] >= SMALL_MAX_CACHE)
        flush(size_class, SMALL_BATCH_SIZE); // flush the thread local free list to the central free list if it exceeds the max cache size
}

The way that works is pretty straight forward, we will make an array of ptrs in the size of our batch size and return those objects to the central free list and we make sure to set any of there metadata objects to on_central_list.

static void flush(int size_class, size_t count) {
    void *batch[SMALL_BATCH_SIZE];
    size_t n = 0;
    while (n < count && small_thread_cache.free_lists[size_class] != NULL) {
        void *obj = small_thread_cache.free_lists[size_class];
        small_thread_cache.free_lists[size_class] = *(void **)obj;
        small_thread_cache.counts[size_class]--;
        batch[n++] = obj;
    }

    if (n == 0)
        return;

    pthread_mutex_lock(&central[size_class].lock);
    for (size_t i = 0; i < n; i++) {
        small_span_metadata_t *meta = span_meta_from_ptr(batch[i]);
        if (!meta)
            continue;
        *(void **)batch[i] = meta->free_list;
        meta->free_list = batch[i];
        if (!meta->on_central_list) {
            meta->next = central[size_class].span_list;
            meta->on_central_list = 1;
            central[size_class].span_list = meta;
        }
    }
    pthread_mutex_unlock(&central[size_class].lock);
}

Well before we are done with small alloc we should talk about the function span_meta_from_ptr, the idea here is we are converting a ptr to the reference we want to look at/check, and how we do this look up is via fairly straight forward radix table in the following format:

// Virtual Address (64-bit Pointer)
// +----------------+----------------+---------------------+
// | Level 0 Index  | Level 1 Index  |     Span Offset     |
// |    (9 bits)    |    (9 bits)    |      (21 bits)      |
// +-------+--------+-------+--------+----------+----------+
//         |                |                   |
//         v                |                   |
//   +-----------+          |                   |
//   |  Level 0  |          |                   |
//   | [Index 0] |          |                   |
//   | [Index 1]----------+ |                   |
//   |    ...    |        | |                   |
//   | [Index N] |        v v                   |
//   +-----------+      +-----------+           |
//                      |  Level 1  |           |
//                      | [Index 0] |           |
//                      | [Index 1]------> [ small_span_metadata_t ]
//                      |    ...    |      | - size_class: 16      |
//                      | [Index N] |      | - free_list: [ptr]--+ |
//                      +-----------+      | - span_base: 0x...  | |
//                                         +---------------------+ |
//                                                                 |
//                                         +-----------------------+
//                                         v
//                                    [ User Memory / 2MB Span ]
//                                    +-------------------------+
//                                    | [Slot] | [Slot] | [Slot]|
//                                    |  16B   |  16B   |  16B  |
//                                   +-------------------------+

// in the context of our code here
// the lvl 0 index points to 1GB chunks of memory - radix_level1_t *level0[512]
// the lvl 1 index points to 2MB spans within the 1GB chunk
// the span offset points to the offset within the 2MB span


// the small_span_metadata_t is the metadata for the span
// the free_list is a linked list of free objects in the span
// the span_base is the base of the span
// the on_central_list is a flag to indicate if the span is on the central free list
// the size_class is the size class of the span


// the idea here is faster lookups and inserts for small allocations
// metadata is also stored in the radix tree and not the span/allocations themselves
// since each span is only one size class
// we can use the radix tree to quickly look up the metadata for a given pointer

This is so we can find and store these span objects easier and half because this was how Google actually implemented their look up and I thought it was a interesting data structure and would be fun to try and implement.

Init

Teardown

dualalloc

Author

Dylan Pachan

Version

1.2.0


Overview

dualalloc is a custom memory allocator that uses mmap as its memory backend.
It supports two operating modes:

  1. Default Mode
  2. GAMBLE_MODE (Arena-Based Mode)

Both modes implement the standard allocation interface:

  • malloc
  • free
  • realloc
  • calloc

All returned memory is aligned to 16 bytes.


Memory Backend

All memory is obtained directly from the operating system using mmap.

The allocator does not rely on the system malloc. Instead, it manages memory explicitly with:

  • mmap for acquiring memory
  • munmap for releasing memory (in default mode)

The allocator metadata declares "mmap" as the memory backend.


Default Mode Design

In default mode:

  • Each call to malloc directly invokes mmap
  • Each call to free directly invokes munmap
  • No persistent allocator state is maintained

This design prioritizes simplicity and correctness.
Each allocation corresponds to a dedicated memory mapping.


GAMBLE_MODE Design

When compiled with GAMBLE_MODE, the allocator switches to an arena-based design.

Architecture

  • The allocator maintains multiple arenas.
  • Each arena manages multiple spans.
  • Each span is a large contiguous region obtained via mmap.
  • Spans are subdivided into blocks.
  • Each block contains metadata including:
    • Block size
    • Allocation state
    • Validation magic value

Allocation Strategy

  • Allocation searches for a free block within a randomly selected arena.
  • If a suitable block is found, it may be split.
  • If no suitable block exists, a new span is allocated using mmap.

Freeing

  • Freed blocks are marked as free.
  • Freed memory remains within the span for reuse.
  • Memory is not immediately returned to the OS in this mode.

Alignment Guarantees

All allocations are aligned to 16 bytes.

Alignment is enforced by rounding requested sizes up to the nearest multiple of 16 before allocation.


Safety Features

  • Magic value validation helps detect invalid frees.
  • calloc includes overflow protection before multiplication.
  • realloc correctly handles:
    • realloc(NULL, size)
    • realloc(ptr, 0)
    • resizing with data preservation

Limitations

  • No coalescing of adjacent free blocks.
  • Not thread-safe.
  • No guard pages.
  • No per-thread caching.
  • In arena mode, spans remain mapped for the lifetime of the process.

Summary

dualalloc demonstrates two allocation strategies built directly on top of mmap:

  • A simple per-allocation mapping approach
  • An arena-based span allocator