In the late stages of bringing up an embedded system, when nothing seems to work, paranoia drifts inevitably to imagined design flaws in the hardware. Perhaps control or timing problems or some unusual sequence in the boot-load process are to blame? Schedule pressure amplifies the stress because you seem to be the last barrier to product release. And this should be easy, surely? Everyone else believes that all the real bugs have been found and resolved in component testing. Your job should be a simple walk through a checklist, with maybe a simple oversight here and there to correct. Why are you taking so long to sign this off?
Car ‘explosion view’ which demonstrates the possible complexity of ramping up many functional small parts into a full system (Source: CEVA)
We’ve all been there. Others are generally right in assuming that the great majority of bugs have been found and fixed, but those simple oversights can manifest in ways that are not so easy to discover. Maybe this is a one-time problem; you go through the pain once, setup a process to avoid that problem in future and you’re done, right? However, in SoC Software integration bringup you’re loading new software builds into the system and there are multiple links in that chain.
You must first create the build on a PC, then load it into flash memory in the system. From there, a simple controller may load its own simple program (from a dedicated memory), using which it will then unpack and upload boot data to main memory. Then the controller might trigger a special bootload instruction in the main processor, say a DSP, signaling that processor to execute a second level of unpacking from the main memory into fast tightly-coupled memory (TCM) sitting close to the processor.
There are a lot of steps here and a lot of places where a simple oversight, if only you knew about that oversight, can make you wonder if you just bricked your chip.
Start with a file-name. If you enter this incorrectly, that’s your own fault isn’t it? But what about the Microsoft “long-dash versus short-dash” problem? MS tools will sometimes ‘helpfully’ autocorrect between short and long dashes, even on copy/paste, where you would assume you are reliably transferring a string; this is incredibly easy to miss and incidentally was a real problem we encountered (as are following examples). A real upload requires many files; this problem happened to just one of them, yet it totally broke the boot. Eventually we traced this back to the programmer who updated the controller software. He copied the name of the file from an email he got from the firmware programmer and that copy/paste created the bug.
Still on the controller, who hasn’t experienced this one? A software update under pressure skipped a file, which also happened to be for the controller software. So that software, the first link in the boot chain, was out of date and that broke the rest of the boot. Took a while to find that one, I can personally attest.
Or think about working with a processor to which you can add your own instructions. Those instructions may be tweaked during design. Or somewhere along the line the design team decided to switch to a newer rev or model of the processor. For whatever reason these changes weren’t always fully mirrored in software updates. So you upload software, some of which compiles to opcodes that behave in unexpected ways because they don’t map to intended operations. None of this was detected in simulation because your software development was slightly out of sync with the hardware.
Complex problems can happen too but in my experience they’re rarely if ever a root cause for basic ‘doesn’t boot’ issues. When you have multiple moving parts in the sequence from code assembly, configuration, a complex boot sequence to ultimate run-time behavior, it’s not surprising that somewhere through bringup, mundane errors will happen. Managing this process for a minimum of surprises requires tight configuration management and debug, built on an understanding of both the hardware and the software. Trying to sidestep any of this – ‘I can manage this part myself with a little scripting and I’ll figure out any problems on the board’ – is a recipe for major headaches. To get an insight into how CEVA users manage these problems, read my post series: The Complexity of Hardware Debugging.
Published on Embedded.com.