Two traditional techniques for acoustic localization are beamforming and time-delay estimation. Historically one has had to choose between variations of these two techniques, thus making a tradeoff between the accuracy of the beamforming approaches or the speed of the time-delay estimation algorithms. In our research we have discovered that this tradeoff is not fundamental to the problem itself but only to the way in which we have been looking at it. In truth, one does not have to sacrifice accuracy for speed, at least in scenarios in which the spacing between the microphones is not too great, such as when the microphones are up to a few meters apart in an indoor room.
We have developed a technique called accumulated correlation that combines the speed of time-delay estimation with the accuracy of beamforming. As seen in the diagram below, the idea is very simple. Signals from pairs of microphones are cross-correlated, and the entire cross-correlation vector is mapped to a common coordinate system to measure the likelihood that the sound source is at any of a number of candidate locations. The prefiltering and temporal smoothing are optional steps that can be added to any algorithm.
The principle at the heart of most acoustic localization algorithms, including accumulated correlation, is that the sound emitted by a sound source will generally take a different amount of time to reach each microphone in the array. By measuring the time of arrival (or equivalently, time delay) of the signal for each microphone, the location of the sound source can be determined. Consider the case when just a pair of microphones are available, as shown in the figure below. The sound source will reach one microphone at time t1 and the other at time t2. The relative time delay t = t1 - t2 can be estimated by selecting the peak of the cross-correlation vector between the two microphone signals. If t is correctly estimated, then the sound source must lie at a point in space such that t1 - t2 = t, which defines one half of a hyperboloid.
Cross-correlation, of course, is not a perfect technique, and there is no guarantee that the peak of the cross-correlation vecotr will be the correct time-delay estimate. With real signals, the cross-correlation vector will usually contain multiple peaks, and the true time delay often does not yield the highest peak. Accumulated correlation handles this problem of noise by retaining the entire cross-correlation vector of each microphone pair, rather than selecting the peak. As shown below, each element of the cross-correlation vector corresponds to a different half-hyperboloid in space, and the value of that element indicates the likelihood that the sound source is located on that half-hyperboloid. For any given candidate location, its likelihood --- based only upon the information from a single pair of microphones --- is given by interpolating the values of the nearby half-hyperboloids. Values from multiple microphone pairs are summed to yield the total likelihood for that location. After all the information has been taken into account, then the location with the highest likelihood is selected as the estimate for the sound source location. As before, preprocessing and/or temporal smoothing may be applied in addition to the basic algorithm just described.
Examining the equations for the different techniques reveals a close connection between them. Accumulated correlation is an approximation to beamforming by assuming that the time at which the microphones receive the sound is a constant offset from the time in which the sound was emitted, which is approximately true in contained environments such as an indoor room. At the same time, accumulated correlation is a generalization of the time-delay estimation formulation because they both share an essential computation, namely to cross-correlate each pair of microphone signals; the difference being that accumulated correlation retains the entire cross-correlation vector rather than just the peak. As a result, accumulated correlation takes all the available information into account before it makes a decision, which ensures robustness. This delaying of decision-making is essence of the principle of least commitment, a well-known philosophy for algorithm development. Accumulated correlation is also known as a direct method because it directly computes the result without any intermediate decisions that have the potential to lose information.
The similarity between the algorithms leads naturally to a unifying framework,
as shown in the table below. Listed are the algorithms of beamforming,
accumulated correlation, and linear intersection (a popular time-delay
estimation technique), along with several other variations. All the
techniques can be expressed as computing the likelihood of a location q using
the equation shown. There are three differences between the
algorithms: (1) how they combine the information from multiple microphone
pairs, expressed by the function G; (2) the integration limits used for
comparing the signals in a single pair, captured by the function T; and
(3) how they weight the energy term, given by the value a.
From the table it is clear that beamforming is accurate, time-delay estimation
is efficient, and accumulated correlation is both accurate and efficient.
Accumulated correlation is simple to implement, is orders of magnitude faster than beamforming, and has been shown to generate results with essentially the same accuracy as that of beamforming. Below are shown the likelihood function computed on a frame of audio with accumulated correlation and with beamforming, illustrating that the difference between the results is often indistinguishable. More extensive experiments, along with a more detailed explanation, can be found in the publications on this topic.