There’s a lot of enthusiasm and demand for smart video surveillance, driven by heightened concerns over acts of terror, to protect us from more traditional threats in our homes, offices and city areas, and to reduce pilfering and larger scale theft or damage in stores, warehouses and other areas. The approach has to be locally smartness (in-camera) because it is impractical to require video feed from all these cameras to be uploaded for constant review. In security applications you want to be able to ignore normal behaviors and be alerted to anomalies, so there is increasing use of Artificial Intelligent in these systems and the market is expected to grow from $22.B in 2016 to over $55B in 2023 for a 13.6% CAGR (ABI Research).

(Source: CEVA)
But there are a couple more fundamental limitations in vision-only based surveillance, starting with field-of view (FOV) coverage, obviously limited for a fixed camera. The second problem is limiting anomaly detection to visual cues. You might hear a gunshot, but you probably won’t see it. And sound is a cue that is independent of direction (if in range).
The obvious solution is to combine video and audio with pan-tilt-zoom control along with smart audio detection. The audio should support multiple microphones with beam-forming for direction-of-arrival detection; this is already quite common in smart speakers. Then an AI stage is trained to detect anomalous noises, for example a gunshot, or screaming, or a breaking window. Multiple mics take care of 360o coverage and analysis of the source provides a direction to point the camera.
Another benefit in this approach is that the camera could standby, burning very little power, until audio trigger detection (which is intrinsically much lower power than an active video camera) wakes it up to analyze a scene. So teaming audio and video detection could be very effective in remote locations where battery-powered operation may be essential.
Opportunities extend beyond using audio as a trigger to guide the camera and then letting vision-based ML take over. Even when we’ve figured out what we want to look at, we humans continue to integrate what we hear together with what we see to draw conclusions. If you can only watch two people having an obviously animated conversation, you don’t know if they’re simply debating last night’s football game or if they’re in a disagreement that might lead to a fight. You have to not only watch them but also hear what they are saying; this doesn’t have to be at a natural language processing level, maybe only needing to check volume, pitch and key words.
The goal in this and other cases is not to guarantee detection of anomalous behavior but to filter down to likely anomalies that should be sent upstream for review and/or recording. AI training for this class of detection would naturally benefit from integration, based on combined test cases of audio and video streams. This does not feel like a huge step; vision-based and audio-based AI have each evolved quite significantly but mostly independently. Combining the two should be a natural next step.
All of this of course depends on being able to add smart audio to your smart video camera. You probably already know how you want to manage your camera but are maybe a little less familiar with the audio side. In a typical solution, you’ll position multiple mics (which can very small) around your device, those will feed into beam-forming and active noise management hardware/software followed by an AI stage for trigger word detection, possibly voice biometric and voice command detection if important for your application and of course anomalous event detection. Products of this type are already available. It can only be a matter of time before combining smart audio with smart video becomes widespread.
The article published on Voicebot.ai.
 
					 
					