mafulechkaNov. 6, 2019, 5:36 a.m.

A fast and thread-safe pool allocator for Qt - Part 2

Content

1. What do developers want to test?
2. Testing with policy
1. 1. Metadata using header
2. 2. Analysis
3. 3. Debug user code
4. 4. Address Sanitizer support

The first part of this article series looked at a pool allocator optimized for small allocations. The developers have said that they do a lot in Qt by allocating QEvent or QObject instances, and a specialized allocator can be useful for their applications as well. So far, the Qt developers' decisions will allocate entire pages of memory as needed and distribute data chunks of a fixed size, which is specified at compile time via a template parameter. It supports a variety of threading models, with different trade-offs in terms of performance, memory efficiency, and concurrency. The developers got very promising results, outperforming general purpose allocators by 3-10 times in multi-threaded tests.

However, with an allocator that can only handle one block size and never returns memory back to the operating system, developers still have a way to go before they can actually support Qt's QEvent and QObject usage scenarios. You can't just waste a library and take up memory, or ask application developers to implement a new/delete operator to be able to allocate instances of their large subclasses.

But before thinking about how to add more complexity, developers need to think about testing. This article will be about this.

What do developers want to test?

Testing an allocator using only black box strategies is very limited. Developers cannot be sure that an allocator is performing correctly by looking only at the pointers it receives. They want to be sure that the allocator is in the state they expect. Ideally, one can do this without having to write test code that more or less duplicates what the allocator already does. Basically it will be just checking that there is the ability to copy/paste code (copy/pasting code). In addition, if you had access to internal data structures, you would have to block them during multi-threaded testing, which you definitely don’t want to do.

But even with a flawless implementation of the allocator
the user code (user-code) will also contain errors. Developers should get predictable errors if their code performs illegal reads or writes, accesses remote objects, has double free errors, or memory leaks. These memory accesses usually cause segmentation violation errors with a useful stack trace. But developers only work with the memory that belongs to the process (unless we hit the edge of a page of memory), and the operating system does not keep track of what is done with this memory. In addition, the use of custom allocators introduces an entirely new type of error, the allocator mismatch. All allocators use data structures and memory layouts that are optimized for their efficient operation, so throwing an arbitrary address into an allocator will result in undefined behavior if that address was not obtained from the same allocator in the first place.

Finally, the Qt allocator is somewhat configurable and designed for rather specific situations. Developers want to collect data on how the allocator is used in real-world conditions so that they can tweak configurations and make sure it's the right type of allocator.

Until the developers know that the allocator works correctly, they can leave aside the error detection in user code. Here are some things to check to be sure of the implementation:

• when allocating as many fragments as fit on the page, and then freeing the first allocated fragment, then the next selection should not allocate an additional page;
• when more data fragments are allocated than fit on the page, then the allocator should allocate a new page;
• in a multi-threaded allocator, used arenas are not deleted if the corresponding thread terminates, but are marked as obsolete;
• in a multi-threaded allocator, the new thread will reuse the arena previously marked as obsolete;
• In a multi-threaded allocator, all threads share the same arena.

Testing with policy

Developers want to make the allocator testable without slowing it down. An allocator that needs to check environment variables or other runtime configurations as part of its code paths to get observable data will have to do some extra work, even if it just checks if the function pointer is nullptr before being called. Developers want to avoid this extra work. Not only does it slow things down, but it can also hide non-obvious race conditions that one might have in one's multi-threaded code.

Fortunately, developers use C++ and implement the allocator as a template class. This allows the use of policy-based design, where an interface is declared through which the allocator can report important events that occur in its implementation:

enum class AnalyzerEvent {
    Allocate,
    Free,
    ArenaCreate,
    ArenaDestroy,
    PageAllocate,
    PageFree,
    MemoryLeaked
};

// the no-op policy
struct NoAnalyzer
{
    reportEvent(AnalyzerEvent, void *) {};
};

template<size_t ChunkSize, ThreadModel Threading = MultiThreaded,
         typename Analyzer = NoAnalyzer>
class PoolAllocator : public Analyzer
{
    // ...
    struct Arena
    {
        Arena(PoolAllocator *allocator)
            : m_allocator(allocator)
        {
            m_allocator->reportEvent(AnalyzerEvent::ArenaCreate, this);
        }
        // ...
        bool grow()
        {
            void *ptr = ::mmap(NULL, sizeof(Page), PROT_READ | PROT_WRITE,
                               MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
            m_allocator->reportEvent(AnalyzerEvent::PageAllocate, ptr);
            // ...
        }
        PoolAllocator *m_allocator;
    };
public:
    // ...
    void *allocate(size_t size)
    {
        reportEvent(AnalyzerEvent::Allocate, &size);
        // ...
    }
    void free(void *ptr)
    {
        reportEvent(AnalyzerEvent::Free, ptr);
        // ...
    }
};

With this pattern, the compiler will by default optimize an empty function call. And since the empty base optimization is applied, the size of the PoolAllocator instance does not increase. It doesn't even require a virtual function table! You need to pass a pointer to the allocator to the Arena instances, where some events will be reported, but you will also see later that this pointer has other practical uses, so it's worth paying these 8 bytes per stream.

If a parser class is provided, user code can access its data members directly through the PoolAllocator instance. Care must be taken not to declare any members that conflict with existing members of the PoolAllocator, and the parser must also be made thread-safe, as configured for the PoolAllocator instance. This usually results in a mutex, so using a parser will have some impact on the behavior and performance of the allocator, especially in highly concurrent environments.

struct UsageAnalyzer
{
    QBasicMutex m_mutex;
    int pageCount = 0;

    void recordEvent(AnalyzerEvent t, void *p)
    {
        std::lock_guard lock(m_mutex);
        switch (t) {
        case AnalyzerEvent::PageAllocate:
            ++pageCount;
            break;
        default:
            break;
        }
    }
};

Using such a parser, you can write test cases that confirm the correct behavior of the allocator, for example, to check the first two elements from the list above:

void tst_PoolAllocator::test()
{
    PoolAllocator<16, SingleThreaded, UsageAnalyzer> allocator;
    const size_t numChunks = allocator.chunksPerPage();

    void *objects[numChunks];
    QCOMPARE(allocator.pageCount, 0);
    for (int i = 0; i < numChunks; ++i)
        objects[i] = allocator.allocate(16);
    QCOMPARE(allocator.pageCount, 1);
    allocator.free(objects[0]);
    objects[0] = allocator.allocate(16);
    QCOMPARE(allocator.pageCount, 1);

    void *object = allocator.allocate(16);
    QCOMPARE(allocator.pageCount, 2);

    allocator.free(object);
    for (int i = 0; i < numChunks; ++i)
        allocator.free(objects[i]);
}

The parser class can be extended to keep track of the overall use of the allocator in the application, and can also report memory leaks when the allocator goes out of scope. What has not yet been done efficiently is to associate observable metadata with individual allocators.

Metadata using header

The Qt allocator doesn't require any allocation information header if we ignore the few pointers the allocator itself carries - 99.8% memory usage. Each 4K page contains one pointer to the arena it belongs to, so there are 4088 bytes of chunks of data. This is great, but sometimes it's useful to be able to associate some metadata with each allocator. For example, we want to confirm that allocations using a pool allocator are indeed short-lived. A use case where such metadata might even be part of the production system is a transparently implemented reference counter or garbage collector.

The data structure for the Node type used so far has been a union: for free nodes - a pointer to the previous node on the free stack; for dedicated nodes - array of bytes:

union Node
{
    Node *previous;
    char chunk[ChunkSize];
};

For the default cases where no header is needed, we'll stick with this layout. If a header is needed, then one more data element needs to be added with the required space. Since the pool allocator is a template, metaprogramming techniques can be used:

struct NoHeader {};

template<size_t ChunkSize, ThreadModel Threading = MultiThreaded,
         typename Header = NoHeader, typename Analyzer = NoAnalyzer>
class PoolAllocator : public Analyzer
{
    static constexpr size_t HeaderSize = std::is_empty<Header>::value
                                       ? 0 : sizeof(Header);
    struct NodeDataWithHeader
    {
        char header[HeaderSize];
        char chunk[ChunkSize];
    };
    struct NodeDataWithoutHeader
    {
        char chunk[ChunkSize];
    };
    using NodeData = typename std::conditional<HeaderSize == 0,
                     NodeDataWithoutHeader, NodeDataWithHeader>::type;
    union Node
    {
        Node *previous;
        NodeData data;
        void initHeader()
        {
            if constexpr (HeaderSize > 0
                      && std::is_default_constructible<Header>::value
                      && !std::is_trivially_default_constructible<Header>::value) {
                (void)new (node->header) Header();
            }
        }
        void destructHeader()
        {
            if constexpr (HeaderSize > 0
                      && std::is_destructible<Header>::value
                      && !std::is_trivially_destructible<Header>::value) {
                reinterpret_cast<Header*>(node->header)->~Header();
            }
        }
        static Node *fromMemory(void *ptr)
        {
            return reinterpret_cast<Node*>(static_cast<char*>(ptr) - HeaderSize);
        }
    };
    // ...

C++ does not allow an array of size zero, and even if it does, the C++ standard requires that each member of a structure have a unique address. Even a zero-size array consumes a byte of memory, and sizeof returns 1 for an empty structure. In order not to waste this byte, we select the appropriate structure at compile time, depending on whether the type is empty or not. Using C++17 if the constexpr operator prevents compile-time errors when there is no header, or if the header doesn't require construction, or destruction. The helper functions hide the details so allocate() and free() don't get much more complicated:

void *allocate(size_t size)
    {
        Q_ASSERT(size <= ChunkSize);
        Node *node = arena()->pop();
        if (!node)
            return nullptr;
        node->initHeader();
        return node->data.chunk;
    }

    void free(void *ptr)
    {
        Node *node = Node::fromMemory(ptr);
        node->destructHeader();
        Arena *arena = Page::fromNode(node)->arena;
        arena->push(node);
    }

Since user code will want to do something with the header, the allocator class can give access to the header for the pointer:

 static Header *header(void *ptr)
    {
        if constexpr (HeaderSize == 0)
            return nullptr;
        return reinterpret_cast<Header*>(Node::fromMemory(ptr)->header);
    }
};

Analysis

Now it is possible to do some interesting things with headers and parsers. Since the developers are creating an allocator optimized for small and short-lived objects, let's confirm that the code actually allocates objects:

struct TimingHeader
{
    QElapsedTimer timer;
    TimingHeader()
    {
        timer.start();
    }
};

struct UsageAnalyzer
{
    UsageAnalyzer(...)
    {}

    QBasicMutex m_mutex;
    QMap<size_t, int> m_sizehistogram;
    QMap<h;int, int> m_timeHistogram;
    int m_pageCount = 0;

    void recordEvent(QAllocator::AnalyzerEvent t, void *p);
};

// the biggest event we have seen in our last test was 112 bytes
using EventAllocator = PoolAllocator<112, MultiThreaded, TimingHeader, UsageAnalyzer>;
Q_GLOBAL_STATIC(EventAllocator, eventAllocator);

void UsageAnalyzer::recordEvent(AnalyzerEvent t, void *p)
{
    std::lock_guard lock(m_mutex);
    switch (t) {
    case AnalyzerEvent::Allocate:
        ++m_sizeHistogram[*static_cast<size_t*>(p)];
        break;
    case AnalyzerEvent::Free:
        ++m_timeHistogram[QEventAllocator::header(p)->timer.elapsed()];
        break;
    case AnalyzerEvent::PageAllocate:
        ++m_pageCount;
        break;
    default:
        break;
    }
}

void *QEvent::operator new(std::size_t size) noexcept
{
    qDebug() << "Usage:" << eventAllocator()->m_sizeHistogram;
    return eventAllocator()->allocate(size);
}

void QEvent::operator delete(void *ptr) noexcept
{
    qDebug() << "Timing:" << eventAllocator()->m_timeHistogram;
    qDebug() << "Pages:" << eventAllocator()->m_pageCount;
    eventAllocator()->free(ptr);
}

Using this in Qt's qcoreevent.cpp and running any Qt test will indeed confirm that QEvent allocations are both small and very short lived, with very few exceptions.

Usage: QMap((24, 2367)(32, 26)(112, 557))
Timing: QMap((0, 44)(1, 10)(2, 91)(3, 116)(4, 330)(5, 524)(6, 546)(7, 397)(8, 338)(9, 219)(10, 135)(11, 41)(12, 21)(13, 14)(14, 7)(15, 6)(30, 12)(31, 5)(32, 17)(33, 3)(35, 7)(36, 10)(37, 3)(41, 11)(44, 5)(45, 7)(46, 2)(47, 4)(49, 1)(63, 2)(90, 9)(95, 5)(96, 3)(215, 1)(19800, 2)(19802, 3))
Pages: 4

You can also see that it peaked at 4 pages allocated to QEvent instances. 16K is definitely more than we'd like, so this definitely needs to be optimized.

Debug user code

Now that we can make sure that the allocator is implemented correctly and that the correct allocator is being used in the first place, we can focus on making it easier for users to spot bugs in their code.

For starters, you can add a few trips in debug builds. The allocator being a template again helps because only the code that uses it needs to be built in debug mode. For example, you can add a simple double-free detection by walking the stack and checking if the incoming node is on it:

void push(Node *node)
{
#ifdef QT_DEBUG
    Node *doubleFreeTest = stack;
    while (doubleFreeTest) {
        if (doubleFreeTest == node)
            qFatal("Double free detected for %p", node->chunk);
        doubleFreeTest = doubleFreeTest->previous;
    }
#endif
    node->previous = stack;
    stack = node;
}

It is also possible to fill the freed memory of a piece of data with garbage using memset to increase the likelihood that the use of such memory will cause errors and undefined behavior that is easy to notice and identify when debugging. Doing this only for debug builds does not add any overhead to code running in production.

Qt's allocator architecture also allows for effective protection against allocator mismatches. Since it is easy to map the user's address to the page, from there to arena, and from there, thanks to the allocator pointer that had to be added in order to be able to call the policy interface, the allocator, the check could be implemented:

bool owns(void *memory) const
    {
        Page *page = Page::fromNode(Node::fromMemory(memory));
        return page->arena->m_allocator == this;
    }
    void free(void *ptr)
    {
        if (!owns(ptr))
            qFatal("Allocator mismatch for %p", ptr);

        Node *node = Node::fromMemory(ptr);
        node->destructHeader();
        Arena *arena = Page::fromNode(node)->arena;
        arena->push(node);
    }

It would help if there were multiple pool allocators for different sizes or types. Of course, it's more likely to crash if free() or owns() are called by a completely unknown pointer, in which case there's no problem.

Address Sanitizer support

A very good and popular memory debugging tool is the Address Sanitizer tool. Compilers that support it provide APIs that allow user code to poison and decompress addresses and memory areas. There are also APIs to check if addresses and memory areas are poisoned, and then you can use them to check if the poisoning and removal of areas is correct. Using the ASan APIs, implicitly, also improves testing of the allocator code itself, since we will touch on poisoned addresses if, for example, pointer arithmetic is correct.

In the code review section, ASan support is added as a separate commit.

We recommend hosting TIMEWEB

Stable hosting, on which the social network EVILEG is located. For projects on Django we recommend VDS hosting.