Isolate any sound from complex audio mixtures using natural language, visual cues, or time-based prompts. Meta's state-of-the-art foundation model for audio source separation.
SAM-Audio understands multiple input modalities, giving you flexible control over audio separation.
Describe what you want to isolate in natural language. Type "dog barking", "singing voice", or "thunderstorm" and SAM-Audio extracts it.
"acoustic guitar strumming"
Click on a person or instrument in a video frame. SAM-Audio uses visual-audio correspondence to isolate that source's sound.
Mark a time range where your target sound occurs. SAM-Audio learns from that segment and extracts similar sounds throughout.
A deep dive into the architecture powering state-of-the-art audio separation.
Your audio mixture and prompts (text, visual, or temporal) are encoded into a shared multimodal representation space using PE-AV (Perception Encoder Audio-Visual).
The model uses a flow-matching diffusion transformer to iteratively refine the separation, generating clean target audio from the mixture conditioned on your prompt.
Multiple candidate separations are generated and ranked using CLAP (text-audio similarity), Judge (separation quality), and ImageBind (visual-audio alignment).
You receive both the isolated target audio and the residual (everything else), giving you complete control over your audio.
From professional production to accessibility applications.
Isolate vocals, drums, bass, or any instrument from existing tracks. Perfect for remixing, sampling, or creating stems.
Remove background noise, isolate speech from multiple speakers, or extract specific sound effects from recordings.
Clean dialogue, separate foley, or re-balance audio elements in post-production without access to original stems.
Help hearing-impaired users focus on specific sounds. Meta partners with Starkey hearing aids and 2gether International.
Extract and analyze specific audio events for research, forensics, or quality assurance applications.
Build new audio applications, integrate separation into DAWs, or create interactive sound experiences.
For developers and researchers who want to understand SAM-Audio's capabilities.
| Size | Parameters | Use Case |
|---|---|---|
| Small | ~500M | Fast inference, edge deployment |
| Base | ~1B | Balanced performance |
| Large | ~3B | Maximum quality |
Each variant also has a TV-specialized version optimized for visual prompting.
SAM-Audio achieves state-of-the-art results on SAM-Audio-Bench, a comprehensive evaluation covering:
Everything you need to know about SAM-Audio.
SAM-Audio (Segment Anything for Audio) is Meta's foundation model for audio source separation. It can isolate any sound from complex audio mixtures using text descriptions, visual cues from video, or time-based prompts. It's part of Meta's "Segment Anything" family of AI models.
Unlike traditional separation tools that only work with predefined categories (vocals, drums, bass), SAM-Audio can isolate any sound you describe. Want to extract just the "bird chirping in the background"? Just type it. This flexibility comes from its multimodal training on text, visual, and temporal prompts.
SAM-Audio supports common audio and video formats including WAV, MP3, MP4, and MOV files. For video files, it can use the visual content to help identify and separate audio sources.
SAM-Audio is released under the SAM License. Check the GitHub repository for the full license terms and usage guidelines for your specific use case.
SAM-Audio achieves state-of-the-art results on speech, music, and sound effect separation benchmarks. However, it may struggle with highly similar sources (like distinguishing individual singers in a choir or specific instruments in an orchestra). For best results, use clear, specific prompts.
SAM-Audio currently cannot: use audio as a prompt input, perform complete blind source separation without prompts, or reliably distinguish between very similar sound sources. It works best with clear, distinct audio events that can be described textually or identified visually.
Yes! You can try SAM-Audio for free at TwoShot. Upload your audio, describe what you want to extract, and get results instantly without any setup or installation.