How does SAM-Audio work?

SAM-Audio uses a flow-matching diffusion transformer architecture with PE-AV (Perception Encoder Audio-Visual) backbone. It takes an audio mixture and prompts as input, encodes them into a shared representation, and generates separated target and residual audio tracks.

What types of prompts does SAM-Audio support?

SAM-Audio supports three prompting methods: Text prompts (natural language descriptions like 'dog barking'), Visual prompts (clicking on objects in video frames), and Temporal/Span prompts (marking time ranges where target audio occurs).

Can I try SAM-Audio for free?

Yes! You can try SAM-Audio for free at TwoShot. Upload your audio and use text prompts to isolate any sound instantly.

SAM-Audio | Segment Anything for Audio - AI Audio Source Separation

Three Ways to Isolate Sound

SAM-Audio understands multiple input modalities, giving you flexible control over audio separation.

Text Prompts

Describe what you want to isolate in natural language. Type "dog barking", "singing voice", or "thunderstorm" and SAM-Audio extracts it.

"acoustic guitar strumming"

Visual Prompts

Click on a person or instrument in a video frame. SAM-Audio uses visual-audio correspondence to isolate that source's sound.

Click on speaker in video

Temporal Prompts

Mark a time range where your target sound occurs. SAM-Audio learns from that segment and extracts similar sounds throughout.

0:05 - 0:08

Try Text Prompts Now

How SAM-Audio Works

A deep dive into the architecture powering state-of-the-art audio separation.

Input Encoding

Your audio mixture and prompts (text, visual, or temporal) are encoded into a shared multimodal representation space using PE-AV (Perception Encoder Audio-Visual).

Flow-Matching Diffusion

The model uses a flow-matching diffusion transformer to iteratively refine the separation, generating clean target audio from the mixture conditioned on your prompt.

Quality Re-ranking

Multiple candidate separations are generated and ranked using CLAP (text-audio similarity), Judge (separation quality), and ImageBind (visual-audio alignment).

Output Generation

You receive both the isolated target audio and the residual (everything else), giving you complete control over your audio.

What Can You Do With SAM-Audio?

From professional production to accessibility applications.

Music Production

Isolate vocals, drums, bass, or any instrument from existing tracks. Perfect for remixing, sampling, or creating stems.

Podcast & Video

Remove background noise, isolate speech from multiple speakers, or extract specific sound effects from recordings.

Film & TV Post

Clean dialogue, separate foley, or re-balance audio elements in post-production without access to original stems.

Accessibility

Help hearing-impaired users focus on specific sounds. Meta partners with Starkey hearing aids and 2gether International.

Audio Analysis

Extract and analyze specific audio events for research, forensics, or quality assurance applications.

Creative Tools

Build new audio applications, integrate separation into DAWs, or create interactive sound experiences.

Start Separating Audio

Technical Details

For developers and researchers who want to understand SAM-Audio's capabilities.

Model Variants

Size	Parameters	Use Case
Small	~500M	Fast inference, edge deployment
Base	~1B	Balanced performance
Large	~3B	Maximum quality

Each variant also has a TV-specialized version optimized for visual prompting.

Supported Formats

.wav .mp3 .mp4 .mov

Quality Assessment

CLAP Score: Text-audio similarity measurement
Judge Score: Separation quality assessment
ImageBind: Visual-audio alignment scoring

Benchmarks

SAM-Audio achieves state-of-the-art results on SAM-Audio-Bench, a comprehensive evaluation covering:

Speech separation and isolation
Music and instrument separation
Sound effects extraction
Multi-modal prompt handling

GitHub Repository Research Blog

Frequently Asked Questions

Everything you need to know about SAM-Audio.

What is SAM-Audio?

SAM-Audio (Segment Anything for Audio) is Meta's foundation model for audio source separation. It can isolate any sound from complex audio mixtures using text descriptions, visual cues from video, or time-based prompts. It's part of Meta's "Segment Anything" family of AI models.

How is SAM-Audio different from other audio separation tools?

Unlike traditional separation tools that only work with predefined categories (vocals, drums, bass), SAM-Audio can isolate any sound you describe. Want to extract just the "bird chirping in the background"? Just type it. This flexibility comes from its multimodal training on text, visual, and temporal prompts.

What audio formats does SAM-Audio support?

SAM-Audio supports common audio and video formats including WAV, MP3, MP4, and MOV files. For video files, it can use the visual content to help identify and separate audio sources.

Can I use SAM-Audio for commercial projects?

SAM-Audio is released under the SAM License. Check the GitHub repository for the full license terms and usage guidelines for your specific use case.

How accurate is SAM-Audio?

SAM-Audio achieves state-of-the-art results on speech, music, and sound effect separation benchmarks. However, it may struggle with highly similar sources (like distinguishing individual singers in a choir or specific instruments in an orchestra). For best results, use clear, specific prompts.

What are the known limitations?

SAM-Audio currently cannot: use audio as a prompt input, perform complete blind source separation without prompts, or reliably distinguish between very similar sound sources. It works best with clear, distinct audio events that can be described textually or identified visually.

Is there a free way to try SAM-Audio?

Yes! You can try SAM-Audio for free at TwoShot. Upload your audio, describe what you want to extract, and get results instantly without any setup or installation.