mafulechkaOct. 22, 2019, 5:06 a.m.

A fast and thread-safe pool allocator for Qt - Part 1

Content

1. Some statistics
2. Allocator general purpose versus specialized
3. Basic data structures
4. Benchmarking
5. Multi-threaded models
1. 1. Locks
2. 2. Arenas
3. 3. Comparison of flow models
6. Multi-threaded benchmarking

The code on which this article is based is under development, with various commits pending under the Allocator topic. Note that the code uses various C++17 features.

A few months ago, the Qt Company developers were working on fixing how QHostInfo sends results to the caller, which resulted in a slight optimization of how memory is allocated for arguments in QMetaCallEvent. Developers use objects of this type in signal/slot connections to queues, and spending some time in this code reminded me that Qt has a pattern of allocating rather small, short-lived objects on the heap, as well as code critical path performance.

So they began to wonder if it was possible to allocate QEvent instances from an allocated memory pool using a special allocator.

Some statistics

Reimplementing QEvent::operator new(size_t) and using QHash as the histogram data structure shows that most QEvent instances in a small Qt application are only 24 bytes in size:

using HistogramHash = QHash<size_t, int>;
Q_GLOBAL_STATIC(HistogramHash, histogram);

void *QEvent::operator new(std::size_t size) noexcept
{
    ++(*histogram())[size];
        qDebug() << "QEvent:" << *histogram();
    return ::operator new(size);
}

By doing the same for QObject, we get a wider spread, but here, too, most of the instances are small. 16 bytes is the most common size.

This is not surprising. The QObject instance has a table of virtual functions and a d-pointer to the QObjectPrivate where all the real data resides. Subclasses of QObject use subclasses of QObjectPrivate for their private data and pass those private instances to QObject for storage.

Thus, QEvent and QObject instances are quite small, and even the application allocates a whole bunch of them. How difficult would it be to implement a memory allocator optimized for these use cases?

Allocator general purpose versus specialized

By default, global C++ operator new allocates memory using malloc(). malloc() is a general purpose allocator that should work well for all sizes, from a few bytes to a few gigabytes, so it has a bit of bookkeeping overhead for each allocation. There are various implementations of malloc(), and some of them are really amazing at optimizing small object allocation and other common use cases. However, they are still general purpose allocators, so they have to make decisions at run time that a specialized allocator might not have to make.

Specialized allocators are typically used in situations where data structures must be allocated very dynamically and data creation and destruction must be high performance. An application area in which small, short-lived objects are often created and destroyed is games: projectiles in a first-person shooter are one example. Other domains are scene graphs in rendering engines, parsers that create ASTs, hash tables. In all these cases, it is often known in advance what kind of memory allocation will be required, and the developers would very much like to reduce several CPU cycles.

Another reason a custom allocator can be valuable is that some platforms don't ship with a memory management block. As recently announced, Qt now runs on microcontrollers and these devices often don't have an MMU. In such environments, memory fragmentation resulting from frequent small allocations can quickly become a serious problem.

By pre-allocating memory and managing this pool in a manner optimized for appropriate usage patterns, a specialized allocator can protect a lot of overhead and kernel round trips. Memory fragmentation issues will at least be limited to this memory pool and are less likely to affect the entire system.

Basic data structures

Qt's pool allocator should give fast allocation and deallocation for various objects, but usually small sizes known at compile time. The allocator must operate on a memory pool large enough to require only very infrequent kernel accesses. Allocations and deallocations should be possible in a random order.

A quick data structure for managing free chunks of memory at no extra cost is an in-place stack: a pre-allocated page of memory is divided into chunks of equal size, each at a node that points to the previous chunk. For selected nodes, you don't need to know the previous node, so you can use union and the entire block for user data. An allocation pops a node from the stack, a deallocation moves the node back onto the stack.

template<size_t ChunkSize>
class PoolAllocator
{
    struct Page
    {
        char memory[4096];
    };

    union Node
    {
         Node *previous;
         char chunk[ChunkSize];
    };
public:
    PoolAllocator()
    {
        initPage(&m_page);
    }
    void *allocate(size_t size)
    {
        Q_ASSERT(size <= ChunkSize);
        if (!stack)
            return nullptr;
        Node *top = stack;
        stack = stack->previous;
        return top->chunk;
    }
    void free(void *ptr)
    {
        Node *top = static_cast<Node*>(ptr);
        top->previous = stack;
        stack = top;
    }

private:
    void initPage(Page *page)
    {
        Node *newStack = reinterpret_cast<Node*>(page->memory);
        newStack->previous = stack;

        const size_t numChunks = sizeof(Page) / sizeof(Node);
        for (size_t i = 0; i < numChunks - 1; ++i, ++newStack)
            newStack[1].previous = newStack;
        stack = newStack;
    }

    Node *stack = nullptr;
    Page m_page;
};

This gives O(1) complexity to allocate and deallocate, and almost 100% memory efficiency. All the allocator needs to save is a pointer to the top of the stack (for a 4K memory page, this means 8 bytes of overhead on a 64-bit system. That's very good!) But for now, the Qt allocator can only work with a fixed amount of memory.

Instead of being limited to one page of memory and returning nullptr when it's used up, one could allocate another page and push the chunks onto the stack. This way the memory pool can grow as much as needed without having to worry about making the memory pool big enough at compile time. It will still be possible to fit more than a hundred 40-bit objects into a single 4K page of memory, and a single kernel access will suffice when it needs to grow. However, it is now necessary to maintain a vector of pointers in memory pages so that pages can be freed when the allocator is destroyed.

template<size_t ChunkSize>
class PoolAllocator
{
    struct Page
    {
        char memory[4096];
    };

    union Node
    {
         Node *previous;
         char chunk[ChunkSize];
    };
public:
    PoolAllocator()
    {
    }
    ~PoolAllocator()
    {
        for (auto page : pages)
            ::free(page);
    }
    void *allocate(size_t size)
    {
        Q_ASSERT(size <= ChunkSize);
        if (!stack && !grow())
            return nullptr;
        Node *top = stack;
        stack = stack->previous;
        return top->chunk;
    }
    void free(void *ptr)
    {
        Node *top = static_cast<Node*>(ptr);
        top->previous = stack;
        stack = top;
    }

private:
    bool grow()
    {
        Page *newPage = static_cast<Page*>(::malloc(sizeof(Page)));
        if (!newPage)
            return false;
        initPage(newPage);
        pages.append(newPage);
        return true;
    }
    void initPage(Page *page)
    {
        Node *newStack = reinterpret_cast<Node*>(page->memory);
        newStack->previous = stack;

        const size_t numChunks = sizeof(Page) / sizeof(Node);
        for (size_t i = 0; i < numChunks - 1; ++i, ++newStack)
            newStack[1].previous = newStack;
        stack = newStack;
    }

    Node *stack = nullptr;
    QVector<Page*> pages;
};

The allocator can now grow one page at a time to the desired size.

Benchmarking

There is a basic pool allocator as a template class created using the data chunk size. This allows the compiler to generate very efficient code. How fast is it really? A trivial benchmark simply allocates and frees memory in a loop. We need to optimize for small objects, so we need to compare the speed for 16, 32, and 64 bytes of memory, allocate 100, 200, and 400 blocks at a time, and then free those blocks in two cycles, freeing first the odd and then the even indexes. To get fairly stable and comparable results, we run 250,000 iterations.

void *pointers[allocations];
    QBENCHMARK {
        for (int i = 0; i < 250000; ++i) {
            for (int j = 0; j < allocations; ++j)
                pointers[j] = ::malloc(allocSize);
            for (int j = 0; j < allocations; j += 2)
                ::free(pointers[j]);
            for (int j = 1; j < allocations; j += 2)
                ::free(pointers[j]);
        }
    }

The allocators that are being compared are the default implementations of malloc on desktop platforms - macOS 10.14 (developer platform), Windows 10 with VC++17, and the Linux malloc system (Ubuntu 18.04 which uses ptmalloc2). On Linux, Google's tcmalloc, Microsoft's mimalloc, and jemalloc from FreeBSD and Firefox were also compared. It is trivial to use these allocators instead of glibc's malloc() on Linux, where they can simply be LD_PRELOAD.

$ LD_PRELOAD=LD_PRELOAD=/usr/local/lib/libtcmalloc.so ./tst_bench_allocator

These results are quite promising:

All allocators have stable enough behavior for this test. The size of the allocated block hardly matters, and doubling the number of allocations takes twice as long.

Considering the ease of implementation, it's not surprising that Qt's specialized pool allocator outperforms even the fastest general-purpose allocators by 3x over mimalloc and 70x over system malloc in macOS 10.14.

Multi-threaded models

One use case for which the Qt developers want to design an allocator is QEvent allocations in general, and QMetaCallEvent allocations in particular. QEvents can be created from multiple threads, so the allocator must be thread-safe. Moreover, QMetaCallEvent is used to pass signals/slots between threads, and events can usually be posted to objects on different threads. Thus, the thread that allocates will not necessarily be the thread that frees the object, and the thread that is allocated might even terminate before the object has been processed and disposed of on the receiving thread.

But there will certainly be cases where client code can ensure that the same thread that allocates memory is also the thread that frees that memory, or when everything happens on the same thread (e.g. scene graph work). Ideally, it is possible to make an optimized implementation for each of these situations without duplicating all the code.

Locks

The options for creating a thread allocator are safe by introducing a lock or by using a lock-free stack implementation using atomic instructions. An atomic implementation will make the stack implementation itself lock-free, but it won't help us grow when we run out of the stack. With a mutex, however, view contention will be very high if multiple threads are trying to allocate and deallocate memory.

It is possible to reduce lock contention by allocating a piece of data from the free stack, but returning the piece to a different "backlog" stack. Then you can use another mutex to protect the backlog, when free data fragments run out, it is possible to swap places in the backlog. Both mutexes must be held for this operation.

void *allocate()
{
    auto stack_guard = lock_stack();
    // try to adopt backlog
    if (!stack) {
        auto backlog_guard = lock_backlog();
        stack = backlog;
        backlog = nullptr;
    }
    if (!stack && !grow())
        return nullptr;
    Node *top = stack;
    stack = stack->previous;
    return top->chunk;
}

void free(void *ptr)
{
    auto guard = lock_backlog();
    Node *top = static_cast<Node*>(ptr);
    top->previous = backlog;
    backlog = top;
}

This helps a little, but running this test with multiple threads quickly shows that this alone cannot be a successful strategy.

Arenas

A more complex architecture would be if each thread ran in its own "arena". Each thread needs to have its own memory pool in order to work, so there will be some overhead. You can use the thread_local variable for a thread-specific arena. The lifetime of arena is related to the lifetime of the thread using std::unique_ptr.

private:
    struct Arena
    {
        void push(void *ptr);
        void *pop();
        void grow();
        void initPage();

        QVector<page*> pages;
        Node *stack = nullptr;
    };

    inline thread_local static std::unique_ptr<Arena> current_thread;

    Arena *arena()
    {
        Arena *a = current_thread.get();
        if (!a) {
            a = new Arena;
            current_thread.reset(a);
        }
        return a;
    }

public:
    void *allocate(size_t size)
    {
        Q_ASSERT(size <= ChunkSize);
        Node *node = return arena()->pop();
        return node ? node->chunk : nullptr;
    }
    void free(void *ptr)
    {
        Node *node = static_cast<Node*>(ptr);
        arena()->push(node);
    }

And as long as client code can ensure that the memory is freed by the same thread that allocated it, no locks need to be used to synchronize operations on the arena stack.

If we combine the two approaches, we get an allocator in which each thread retrieves fragments from its own arena without any blocking. Threads freeing memory coming from another thread's arena will push chunks onto the original arena's backlog stack, so they only compete with other similar freeing threads for the needed lock. When the stack is exhausted, the backlog changes, but then the allocator thread will also have to compete for the lock. However, we need to be able to map some of the memory that the client code passes to free() to the correct arena, since it's no longer possible to just use the arena thread_local pointer, otherwise we'll end up with a freeing thread that will take all the memory from the allocating threads. Having a pointer to arena for each piece of memory on the stack would be one option, but introduces quite a bit of resource allocation overhead. Having to write 8 bytes for a 40 byte fragment means that 20% of the memory is wasted.

Instead, you can use the fact that memory pages allocated with mmap() are always 4K byte aligned. In a page address allocated using mmap (or VirtualAlloc on Windows), the bottom 12 bits (4096 = 2^12) will always be zero, i.e. 0x123456789abcd000. The data chunk addresses on the page will have the same top 52 bits, so it is possible to apply a simple bitmask to the data chunk address and get the page address. You can then make the page of memory contain a pointer to the arena it belongs to, so you only have to sacrifice 8 bytes for each 4K page.

pivate:
    struct Page
    {
        char memory[4096 - sizeof(Arena*)];
        Arena *arena;

        static Page *fromNode(Node *node)
        {
            return reinterpret_cast<Page*>(uintptr_t(node) & ~(sizeof(Page) - 1));
        }
    };

    struct Arena
    {
        bool grow()
        {
            void *ptr = ::mmap(NULL, sizeof(Page), PROT_READ | PROT_WRITE,
                               MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
            Q_ASSERT_X(!(uintptr_t(ptr) % sizeof(Page)),
                       "PoolAllocator", "Page Alignment Error!");
            if (!ptr)
                return false;

            Page *newPage = static_cast<Page*>(ptr);
            initPage(newPage);
            newPage->arena = this;
            pages.append(newPage);
            return true;
        }
    };

public:
    void *allocate(size_t size)
    {
        Q_ASSERT(size <= ChunkSize);
        Node *node = arena()->pop();
        return node ? node->chunk : nullptr;
    }

    void free(void *ptr)
    {
        Node *node = static_cast<Node*>(ptr);
        Arena *arena = Page::fromNode(node)->arena;
        arena->push(node);
    }

Adding to the complexity is the possibility that the selection thread ends while the objects in its arena are still alive. To do this, we need to separate the lifetime of the memory pool from the arena (which is linked to the lifetime of the thread via the thread_local storage). The allocator can take care of such stale memory pools. Future streams will accept them.

To implement this, you need to find out if there is still allocated memory. The data structure does not know anything about the allocated memory, only about the free memory. So, the fastest way to check is to iterate over the entire stack and count how many nodes it has. If there are fewer nodes than will fit on all allocated pages, memory is still being used.

Comparison of flow models

So now there are four models of multithreading, with different trade-offs in terms of performance, memory overhead, and concurrency.

• Single-threaded (single-threaded), as a trivial case with one arena and no locks.
• The shared pool model, in which all threads operate on the same arena with two locks.
• A multi-pool model, where each thread must stay within its own arena and does not need any locks.
• Finally, the multi-threaded model, in which threads are allocated from their own arena, but memory can be freed by any thread, and threads can terminate while the memory allocated from their arena is still in use.

The first three models will have many use cases, but for the QEvent case, you will need a multi-threaded model.

Running the previous test using these different implementations, but without using threads, measures the net overhead that the different strategies represent:

It seems that the additional code required to implement the arena strategy introduces very little overhead. The shared pool model, on the other hand, won't have much of an advantage over just using malloc, although it can still be useful when considering memory fragmentation.

Multi-threaded benchmarking

Benchmarking allocators in a realistic, multi-threaded scenario is not easy. Even the malloc system is very fast, so doing anything non-trivial other than allocating and freeing memory in a test makes it hard to measure the real impact of the allocator. However, if memory is only allocated and deallocated, then the impact of lock contention in allocators becomes greatly exaggerated. On the other hand, developers may need to use synchronization primitives such as mutexes or wait conditions to model data flow in realistic producer/consumer scenarios (for example, when Qt uses QMetaCallEvent for signal/slot connections that cross flows). This will hide the impact of locks in the allocator.

However, the relative numbers should give some idea of how efficient the Qt allocator is, which is significantly faster than others under stress, and is highly likely to have some positive impact on the performance of a real application. Based on the observation from the first test that all allocators have the same performance characteristics, no matter how big the objects are. Now we just run one test for 16-byte objects, but with as many threads as we have processors.

The threading model used here is the full multithreaded implementation, where the Qt allocator still outperforms the fastest general purpose alternative by a factor of 2.5, and the malloc system defaults to a factor of at least 7.
In the next article, we will look at methods for testing and debugging memory. Let's show how the ideas developed so far can be used when distributing objects of different sizes. And maybe we'll find out how to free the blank pages.

We recommend hosting TIMEWEB

Stable hosting, on which the social network EVILEG is located. For projects on Django we recommend VDS hosting.