Experienced embedded-systems programmers already know most of the tricks to optimize code to a target platform, but many of us have come relatively recently to embedded-systems programming, having learned our coding skills on less constrained platforms. There we didn’t have to worry too much about hardware details beyond general efficiency of algorithms, but in embedded applications we need to be a lot more careful, certainly for performance but also to fit what we would like to do to the practical limitations of our hardware target, especially in types and size of available memory. Memory in these systems is the suitcase into which you must fit all your software and all your working data, and the suitcase is often a lot smaller than you would prefer it to be.
As an added complication, there’s often more than one type of memory in these systems. To keep programming simple, you see just one logical memory space, but some address ranges may be implemented in hardware in different ways. Some of the space may be implemented as external main memory, accessible from the processor through one or more levels of cache. A common hardware optimization will implement another range as tightly-coupled memory (TCM). This sits inside the same chip as the processor, usually right next to that processor. TCM provides guaranteed single-clock-cycle access for any instructions or data stored in that memory, unlike standard memory which will only be able to provide that performance if the instruction/data is already in cache; otherwise it has to go out to the main memory, taking many more clock cycles. TCM is one example of (memory-mapped) fast on-chip memory; there can be other uses, such as for image buffers for fast access in image processing.
One more consideration – using on-chip memory reduces power consumption, whereas going to main memory draws more power, thanks to higher current to drive all those package pins and board interconnects between chips. This is an important consideration in low-power applications.
Why not just use big on-chip memories and load/store infrequently from off-chip? Unfortunately large on-chip memories increase chip area significantly, and as chip size grows the device becomes more expensive and less competitive. System architects have to be very careful to balance performance gains against this cost, considering whether they might only provide say 16KB of TCM versus as much as perhaps 1MB. That puts a lot of responsibility back on you, the programmer, to use or plan these memories (if you have a say in early chip architecture) as frugally and carefully as you possibly can, especially when it comes to which functions or data you want to use fast memory.
Some of what you need to do here is fairly obvious; I’ll assume you’re starting either with a PC-based implementation or an implementation developed for an earlier product. Since you’re obviously interested in DSPs, you probably plan to do a lot of floating-point calculations. Reduce datatypes from double-precision to single-precision wherever you can; this alone might cut data-size in half.
Scratch memory pools, a method to allocate chunks of memory in one go to serve the needs of multiple smaller related allocation needs, are popular for speed in allocation and deallocation but can be very expensive in memory. Try to merge all of these into one memory pool as long as they’re not used in parallel or bite the bullet and return to traditional mallocs on the heap; this may be a bit slower but can be a lot more efficient in memory.
Especially when it comes to TCM, profile the code to find the functions which consume the most run-time. Your strategy here is going to be to decide which of these, starting with the highest-demand function, you can fit into the TCM. Of course there has to be some judgement here. If a high-demand function calls a low-demand function, can you afford to have that low-demand function pulled from cache? Maybe this will be OK as long as the cache-hit rate will be high or if an occasional longer delay is tolerable.
An example where a longer delay might be OK would be in a music player supporting both MP3 and FLAC decoders. You’re only going to use one at most per song, so they don’t both need to be resident in fast memory. Accept the delay to load whichever is needed, on demand from off-chip into the fast memory.
You want to squeeze production code and data to the smallest size possible, so as a point of general good hygiene, make sure that all debug, profiling and logging code is bracketed in pragmas which can be disabled for production build. In PC-code you might not worry too much about this (especially if you want to run the debugger on production software) but here it’s essential. Conversely, you also should make sure you run all your regression tests with that code disabled. It only takes one overlooked run-time dependency inside the debug to create downstream nightmares.
Equally, make sure that every bit of code inside your software is being used. Run coverage tests. If you find code that isn’t being used, maybe it’s a hangover from an earlier rev where it might have been needed. Here it isn’t, so you should be able to get rid of it, right? Again, you have to be careful. Maybe it’s error-handling for a very rare case which can’t be overlooked. Maybe it should be included in the regression tests but it’s too hard to trigger directly. You’ll have to decide based on discussion with the architect and maybe the hardware team.
And finally, argue with the architect (and marketing if needed) about which of the features they are demanding be included are really essential. They might not realize that after every optimization you can possibly think of, the suitcase still won’t close. Then they’re going to have to decide which really cool feature they really, really wanted may have to be sacrificed. Or maybe they have to go back to the business team and demand larger on-chip memories, using information you can provide on just how much those memories need to grow. Either way, you’re going to look good!
Published on Embedded.com.