Sight is the sense on which humans chiefly rely to navigate the world, but sound might be just as important — it’s been shown that people can learn to follow subtle cues in the volume, direction, and speed of audio signals. Inspired by this, scientists at the University of Eastern Finland recently proposed in a preprint paper (“Do Autonomous Agents Benefit from Hearing?“) an AI system that complements visual data with sound. Preliminary results, they say, indicate that the approach improves agents’ ability to complete goals in a 3D maze.
“Learning using only visual information may not always be easy for the learning agent,” wrote the coauthors. “For example, it is difficult for the agent to reach the target using only visual information in scenarios where there are many rooms and there is no direct line of sight between the agent and the target … Thus, the use of audio features could provide valuable information for such problems.”
The researchers’ AI took the form of a deep Q-network, a type of model that’s flexible to different kinds of data (i.e., image pixels and audio) and that’s been successfully applied to play Atari games. They trained it in VizDoom, a digital research environment built on the first-person shooter game Doom, on two different audio features: pitch and raw samples.
As the team explained: “We encode[d] information about the environment (distance to the goal) into the pitch of the sample. Then, the sample [was] provided to the agent along with the image … Since distance to the goal is encoded in the overall pitch of the … sample, these features could be easily digested for useful information for the agent (higher pitch equals closer to target). These features work as a sanity check that providing information on distance to goal is beneficial to the agent.”

In experiments on a powerful PC running a custom VizDoom scenario, the scientists tasked AI agents with navigating in mazes — i.e., turning left, right, forward, or backward — to various rooms. The agents initially made completely random actions, but over time, as they received rewards for achieving goals (a technique known as reinforcement learning), their performance improved.
Two different types of setups were tested: one in which the agents were placed randomly in a room and a second in which they were spawned in any of the five rooms. In the former, visual information with both pitch and raw audio provided better average reward per test compared to using only the visual, and in the case of the latter, audio features together with the visual enabled the agents to reach goals “most of the time.”
“The use of only visual provides average success rate of 43%. But, the augmentation of visual with raw audio, and visual with pitch provides average success rates of 87% and 86%, respectively,” wrote the researchers. “Similarly, the average required number of steps to reach the target using only visual information is 1420. But, the addition of complementary raw audio and pitch to the visual reduces the number of steps to 751 and 614, respectively.”
The team leaves to future work experiments in different environments and tests other than video games.