The case for a better buffer

Published on Aug 14, 2018
By Terry / Senior System Design Engineer

For some time now I work for a company that makes embedded products for healthcare. These all tend to acquire some data, record and/or process this data, then sent it to a tablet/PC for analysis. One of the main characteristics is that data is ‘flowing’ through the system, often a low power embedded microcontroller. As these microcontrollers do not have much memory (or processing power), to make sure data is handled properly it needs to be buffered. Often more than once – depending on the processing steps required.

To buffer data, streaming data, on such embedded microcontrollers there are quite some challenges: since there is not much memory available only a little can be buffered – it needs to be processed quickly. The number of copy actions is limited – since the processor cannot always use DMA (Direct Memory Access) for copying it needs to be running and thus consumes power. To prevent memory fragmentation the buffering is done mostly in ringbuffer-like structures. Data usually comes in via peripheral busses (SPI/I2C) using DMA and interrupts, a buffer is needed to decouple the application from the ISR (Interrupt Service Routine) context. Would it not be nice to solve this entanglement of restrictions in one go?

It would require a new type of buffer. A ringbuffer, most likely, since this would avoid the fragmentation of the memory. The implementation should be small, as it is for embedded use. Since peripheral DMA is often available, if we could request a ‘chunck’ of memory, let DMA fill this and ‘commit’ when DMA is done – this would allow the microcontroller to do other work while the buffer is being filled, even sleep – to save power. If DMA can fill the buffer, we do not need to reserve some memory for DMA , it can be put in buffer directly; this would not only save a copy of the data, but also removes the need for storing the data temporarily – saving power and memory. But if DMA fills the buffer, it needs to be thread safe – at least for a single producer (the DMA) and single consumer (the application) – this decouples the ISR from the main application context. Although a mutex can be used, this is rather slow on a microcontroller. If the buffer can be lock free (using atomics) it will help performance (and again save power). DMA works with ‘chunks’ of memory, so does a common ‘memcpy()’ or ‘std::copy()’. If the buffer can manage its data in ‘chunks’, or ‘contiguous blocks’ the user will not need to worry about data being wrapped halfway through. There needs to be some management to handle these ‘chunks’ in the buffer: if there is no room at the end of the buffer, there may be enough at the start. The management should prevent a next, smaller chuck from being allocated at the end – making the data not consistent anymore.


This would require a combination of a few buffers and techniques – so far I have not been able to find one. During my search I found some other interesting buffers which served as inspiration (and can be used for other purposes very well):

    This is a small, fast and robust buffer. It works on a per-element basis, each element is copied in/out of the buffer. The website has a very nice read on the background of the buffer.
    This is a more generic buffer, with more emphasis on speed. It can work with ‘chuncks’, although data must be copied into the buffer – this either succeeds or fails – depending if the buffer has room. Also here the website has a very nice read on the background of the buffer. A more advanced multi-producer-multi-consumer buffer is available as well.
    In the Boost libraries the spsc_queue is available. It works on a per-element basis, each element is copied in/out of the buffer.
    A buffer working with ‘chuncks’. The first picture more clearly illustrates how working with ‘chuncks’ is intended. Alas, the buffer is not thread safe. A smaller implementation is available here:

When searching for a fitting buffer, I noticed many people tried and created their own version of a thread safe, lock free, ringbuffer, in one variant or another. Many people claim their buffer to be thread safe – but not being able to explain as to how this is guaranteed. I love to trust these people to be right, alas I cannot be sure – threading issues are most difficult to find and I would rather avoid them if possible. Before using such buffer – be sure to check for proof of thread safety and if the buffer comes with some threading tests – even write your own to build up confidence with the buffer. Better be safe than sorry.

That said, I was not able to find a buffer which matched my requirements and thus I tried to write my own. The first attempt I started very enthusiastic, having a go with Test Driven Development and rather quickly got stuck in the various corner cases – the refactoring became an entangled mess. Thinking back at this attempt I probably did not write down my requirements clearly enough.

My second attempt was to draw – in very small steps – the actions that the buffer could do. Visualize how data would flow through the buffer and thus draw the behaviour. Then implement each of the methods as needed. This did result again in many corner cases, but now the buffer was functional complete and robust. When testing for thread safety however, it failed miserably. I managed to put in not 1, but 2 race conditions and this took quite some time to figure out.

My third attempt was a redesign to remove 1 race condition altogether, while making sure the second (although still present) cannot happen. After more testing I can validate the buffer is thread safe, and started documenting the code as to how this is guaranteed. This ‘proof’, or explanation of how this buffer is thread safe means a lot to me, I took quite some time to validate this part.

For the people interested, I included the buffer in a link below. Note that the intended use is on an embedded platform, often without OS – bare metal C++11. It has an unrestricted licence, and can easily be used in any sort of project. If you do happen to find an issue or have an improvement, please let me know. Code is available here:

This buffer has some peculiarities, if a more generic buffer can be used for your project it would still have my preference – although this version sure has been a great benefit for me.

A week or so after releasing this buffer it occurred to me that the requirement of contiguous blocks is useful only a in a few limited cases. For most cases the mentioned ‘regular’ ringbuffer will work better as it has slightly less overhead and it can fill all elements in the buffer since the data is allowed to be split/wrapped. This would allow for slightly smaller buffers (always a plus in a memory constrained system). Since I was on a roll with the test framework, I added my variant as well. Code is available here: