Details
Stable Audio 2.5 uses a latent diffusion model architecture with a diffusion transformer (DiT) and accepts text-to-audio and audio-to-audio inputs, including audio inpainting. It reduces generation from 50 computational steps to 8 via Stability AI's proprietary Adversarial Relativistic-Contrastive (ARC) post-training method, producing tracks in under 2 seconds on GPU. The model was trained on a fully licensed dataset. A lighter open-source variant, Stable Audio Open Small (341M parameters), was co-developed with Arm for on-device mobile audio generation. Stable Audio 2.0, which preceded 2.5, was trained on a licensed dataset from the AudioSparx music library.
Have evidence about Stability AI's AI practices? Submit a report.
Report a Sighting →