Deep dive into the STFT-based audio analysis pipeline.
The core of Music Video Creator’s audio analysis.
| Parameter | Default | Description |
|---|---|---|
window_size |
2048 samples | FFT window size (~46ms at 44.1kHz) |
hop_length |
512 samples | Window overlap (~11ms) |
window_function |
Hann | Windowing function |
sample_rate |
Auto-detect | From input WAV file |
Larger window_size:
✓ Better frequency resolution
✗ Worse time resolution (smeared transients)
Smaller window_size:
✓ Better time resolution (crisp transients)
✗ Worse frequency resolution (blurred bass)
Default (2048) provides good balance for music visualization.
Features extracted per video frame from the STFT spectrogram.
| Feature | Range | Description | Visual Mapping |
|---|---|---|---|
rms |
0-1 | Root mean square energy (loudness) | Size, scale, brightness |
bass_energy |
0-1 | Energy in 20-250 Hz band | Disk rotation, terrain height |
mid_energy |
0-1 | Energy in 250-4000 Hz band | Particle velocity, ring count |
high_energy |
0-1 | Energy in 4000-20000 Hz band | Sparkle, edge effects |
| Feature | Range | Description | Visual Mapping |
|---|---|---|---|
spectral_centroid |
Hz | “Brightness” of sound | Color hue shift |
peak_frequency |
Hz | Dominant frequency | Not typically used |
onset_strength |
0-1 | Transient/attack strength | Glitch intensity |
| Feature | Type | Description | Visual Mapping |
|---|---|---|---|
is_beat |
bool | Hard beat marker | Trigger zoom/flash/shake |
beat_strength |
0-1 | Smooth beat envelope | Continuous pulsing |
Uses spectral flux + peak picking algorithm.
1. Compute onset envelope from spectral flux:
onset[i] = sum(max(spectrum[i] - spectrum[i-1], 0)²)
2. Smooth with moving average (window=5)
3. Find peaks above threshold:
- threshold = max(onset) × threshold_ratio
- min_interval between peaks
4. Convert peak frames to timestamps
| Parameter | Default | Effect |
|---|---|---|
threshold_ratio |
0.3 | Lower = more beats detected |
min_interval_sec |
0.15 | Prevents double-triggers |
decay_rate |
0.85 | How fast beat_strength fades |
Two methods, highest confidence wins.
1. Compute autocorrelation of onset envelope
2. Find peak in BPM range (60-200 BPM → lag frames)
3. Convert peak lag to BPM
4. Confidence = (peak - mean) / (2 × std)
1. Calculate time between detected beats
2. Filter outliers (< 0.5× or > 2× median)
3. BPM = 60 / mean_interval
4. Confidence = 1 - coefficient_of_variation
For GPU shaders, the spectrum is resampled to 128 bins.
Better for music - matches human perception of pitch.
Bin 0: 20-25 Hz (sub-bass)
Bin 32: ~100 Hz (bass)
Bin 64: ~500 Hz (low-mids)
Bin 96: ~2500 Hz (presence)
Bin 127: ~16000 Hz (brilliance)
Equal frequency spacing - rarely used.
Default band boundaries:
| Band | Range | Typical Content |
|---|---|---|
| Bass | 20-250 Hz | Kick drum, bass guitar, sub |
| Mid | 250-4000 Hz | Vocals, guitars, snare |
| High | 4000-20000 Hz | Cymbals, hi-hats, air |
Bands are configurable via AudioConfig.
Energy-based segmentation without ML.
1. Compute rolling mean/std of RMS energy
2. Identify significant energy transitions
3. Classify segments based on:
- Position (start/end of track)
- Energy level (low/medium/high)
- Energy trend (rising/falling/stable)
| Scene | Energy | Position | Trend |
|---|---|---|---|
| intro | Low | Start | - |
| verse | Medium | Any | Stable |
| build | Medium→High | Any | Rising |
| drop | High | Any | Peak |
| breakdown | Low | After drop | Falling |
| outro | Low | End | Falling |