Sony AI releases Woosh foundation model for sound effect generation

Become a member of GB MAX to gain exclusive access to the industry and to the most influential global B2B leadership community in the business of gaming, entertainment, and tech. Join now and also get a VIP ticket to GamesBeat Next (Nov 2-3, SF).

Sony AI has released Woosh, a foundation AI model built specifically for sound effect generation – an area most generative audio models have largely overlooked in favor of music or general audio generation.

Sony AI, which is a research arm of Sony Corp., described Woosh in a research paper and I interviewed two of the authors for this article: Mark Ferras and Hakim Missoum. Other authors included Gaetan Hadjeres, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Joan Serra, and Yuki Mitsufuji.

Built for workflows used in gaming, film, and interactive media, Woosh supports both:

Text-to-audio: generating a sound effect from a written description.
Video-to-audio: generating sound directly from a video sequence, with an optional text prompt to guide the output.

The project was built around a core insight: professional sound design requires fundamentally different data and controls than general audio AI systems. One of the clearest findings was the significant gap between public and private training data.

Sony AI created two versions of the model:

A private model trained on licensed professional sound effect libraries, including Pro Sound Effects and BOOM, optimized for studio-grade output.
A public model that uses the same architecture as the private model but is trained on publicly available datasets.

The private model, trained on commercial libraries, significantly outperforms public alternatives on professional sound effect data. The public model outperforms comparable open-source models on public benchmarks. The public model is now available for the research community to access and experiment with. The private model is also available for those who are interested in licensing it.

You can find more information about Woosh here: https://ai.sony/blog/introducing-woosh-sony-ais-sound-effect-foundation-model
To explore Woosh, access the model weights, and listen to demo samples, visit: https://sonyresearch.github.io/Woosh/
To access the Woosh-Flow Private, please visit: https://sonyresearch.github.io/Woosh/flow-private.html

Sony has created Woosh as an AI model that can generate sound effects. Source: Sony

In the research paper, the authors wrote that the audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines.

Sony AI has publicly released the Woosh sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, Sony provides a high-quality audio encoder/decoder model.

It also provides a text-audio alignment model for conditioning, together with text-to-audio and video-to-audio generative models.

Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. The evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux.

The inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

The challenge

The problem with current research on generative audio modeling largely focuses on conditioned models, mainly based on textual descriptions of audio. While significant progress has been made in terms of modeling, most approaches do not provide open weights for the research community to build upon.

Other approaches like do provide open weights, but use low audio sampling rates only up to 16 kHz. Notable exceptions are AudioLDM2, StableAudio-Open, and TangoFlux, which generate higher quality audio while addressing the generation of both general audio and music.

Others may be familiar with the notion of music generation via AI because of MusicGen, a high-quality open model which specializes on music generation. But Sony has researched a text-conditioned generative model specializing on instantaneous high-quality sound effect generation. Based on the multimodal FLUX-Kontext extension, Sony’s latent diffusion model (LDM) has been optimized from the ground up for sound effects and for professional use.

As part of the public release (https://github.com/SonyResearch/Woosh/), the researchers provide inference code and open weights for non-commercial use for the full pipeline: encoder/decoder, text-conditioning, and diffusion models, the latter being further distilled for instant generation. Sony benchmarked its public and private models against StableAudio-Open and TangoFlux, and the CLAP model against LAION-CLAP.

For the best audio generation quality, a version trained exclusively on a large amount of studio- quality licensed sound effect libraries is internally available.

Woosh: A Sound Effects Foundation Model

The current public release provides four models that address the text-to-audio (T2A) and video-toaudio (V2A) tasks:

Audio encoder/decoder (Woosh-AE) — High-quality latent encoder/decoder providing latents
for generative modeling and decoding audio from generated latents.
Text conditioning (Woosh-CLAP) — Multimodal text-audio alignment model providing token latents for diffusion model conditioning or CLAP scoring.
T2A Generation (Woosh-Flow and Woosh-DFlow) — Original and distilled LDMs generating audio unconditionally or from a given a text prompt.
V2A Generation (Woosh-VFlow and Woosh-DVFlow) — Multimodal LDM generating audio from a video sequence with optional text prompts.

The Woosh-AE module is based on the VOCOS architecture, a GAN-based vocoder operating on the domain of the short-time Fourier transform (STFT) complex coefficients.

Woosh: A Sound Effects Foundation Model

The model includes internal music (IM), which is an internal dataset consisting of 78,000 commercially-licensed popular music songs. Single stems as well as mixed stems were used for training, with stem labels used only for the purpose of mixing. The sample rate is 44.1 kHz.

To train the Woosh-AE-Private model, Sony AI used VCTK, Wapy, Internal Music and a mix of several commercially-licensed studio-quality sound effect libraries, involving around one million samples and 5500 h of commercial audio.

The pre-processing for training both public and private autoencoders consisted of resampling the audio to 48 kHz and randomly taking 1-second long chunks from each audio. Since the architecture is fully-convolutional, the model can operate with any length at inference time.

How it started

Missoum said the researchers started on the project a few years ago. They looked at the AI landscape, especially the generation landscape, and wondered what AI models could still be created. One of the researchers suggested audio could be useful, and he noted there were no specific audio or sound effect generative models.

“The he impetus was to create a model that was tailored to sound effects, but also meets the requirements and expectations of professional audio creators and sound designers, rather than just having a model online for amateur content creators,” said Missoum.

The good thing was that Sony R&D had access to a lot of professional audio experts.

Where can it be used?

Asked where it could be useful Ferras told me that the main goal is to help creators such as sound designers work faster. Those creators work by testing out different sounds and they go through an iterative process.

The tool could be used to give sound designers access to more new sounds. That was the idea behind the research, said Missoum.

“The big thing that you hear not only in the audio space but generally from artists, when it comes to generative AI, is the need for more controllability with generative output, and so that’s something we really took to heart and tried to implement in our models, and that’s how we added some capabilities to work at it,” said Missoum.

The video-to-audio solution is a different kind of tool. You can show the model a video and it can create the audio track for the video. It could also rely on text prompts and video as well. If you show people walking, the model could try to generate the sound of their footsteps. I asked if you showed it a soundless racing car, they said that it could generate a race car sound.

Ferras said it took a couple of years to build the model because the team had to build everything from scratch. They also compressed the audio in a way that it would be most usable for industry professionals. That meant they had to do less compression than usual, and that took time to get right.

“Our audio encoder compresses maybe three or four times the audio that others do,” Ferras said.

The model hasn’t been used in production yet, since it’s pretty new.

Missoum said, “We’re working with some, some other professionals across game studios and game studios just to evaluate the model and give us feedback.”

The researchers want to know the capabilities that professionals want to see in the model. One possible feature is to add variation to a sound. A gunshot might always sound the same. But if there was variation added to it, then it would sound different and perhaps more realistic. And in that case, sound designers wouldn’t have to generation tons of variations themselves.

“Those are some of the use cases that we’re doing, and that we’re trying to add to the model, at least in the public model,” Missoum said.

The big picture

Sony Interactive Entertainment CEO, Hideaki Nishino, gave a talk on AI during the company’s most recent investor relations call. He communicated that Sony was going all in on AI tech for games.

Missoum said the goal is to empower creators.

“The idea is to really help creative game developers make better games at a faster pace. It’s to empower creators and not to replace them.” Ferras said. “And hopefully less expensive. Big games are getting more and more expensive to make, and so the idea is really to optimize that using AI technology.”

The team has also learned from Sony AI’s work creating GT Sophy, a driving AI that is meant to give the best human players a run for their money in racing games.

“It’s a great example how we can enhance the game experience using this type of technology, and to me, like, GT Sophy is also such a great experience of [how it can create a good] reception from the gaming community.”

It’s worth noting the gaming community in the West has been vociferous about not wanting AI slop and staying away from AI that is meant to replace humans in their jobs. But with something like GT Sophy, which enhances the gaming experience, the reaction is positive.

“We are trying to just replicate that in other areas,” Ferras said.

“We would like to see this technology being used by audio teams across film studios and game studios, and see really how we could accelerate the workflows and get feedback from them to make their jobs easier,” Missoum said. “At the same time, also, what I would like to see is how can we use this technology to enable new ways of creating new experiences.”

Missoum said the team doesn’t want companies to conclude that they don’t need sound designers if they just use this tool instead. Rather, they want it to enhance the artistic process of human creators. One of the challenges sound designers face is that it is very hard to exactly describe with words the kind of sound they want to generate.

This tool could make it easier for the sound designers to express themselves.

The team didn’t go out and collect sounds themselves. They relied on licensed professional data sets, namely from two of the biggest libraries out there which are used widely by the industry. Sony AI obtained legal permission to use those libraries. Over time, they expect the quality of sound effect generation will get better.

“In terms of like the quality of the output itself, I think we’re in a pretty good place already,” Missoum said. “What’s left to do is really build the right tooling around the model, the right UI and UX and capabilities on top of that foundational generative model.”

Sony AI is working on that part right now.

“What we’re trying to do is to basically take that model and make it integrated into existing audio pipelines and audio workloads that sound designers are familiar with,” Missoum said. “We try to be as seamless as possible in our integration.”

So far, the public model is non-commercial as it’s research, but the public model could be turned into a commercial tool with more work, Missoum said.

The challenge

Woosh: A Sound Effects Foundation Model

Woosh: A Sound Effects Foundation Model

How it started

Where can it be used?

The big picture

Subscribe to our newsletter