music-video-creator-showcase

Audio Analysis

Deep dive into the STFT-based audio analysis pipeline.

Short-Time Fourier Transform (STFT)

The core of Music Video Creator’s audio analysis.

Parameters

Parameter Default Description
window_size 2048 samples FFT window size (~46ms at 44.1kHz)
hop_length 512 samples Window overlap (~11ms)
window_function Hann Windowing function
sample_rate Auto-detect From input WAV file

Trade-offs

Larger window_size:
  ✓ Better frequency resolution
  ✗ Worse time resolution (smeared transients)

Smaller window_size:
  ✓ Better time resolution (crisp transients)
  ✗ Worse frequency resolution (blurred bass)

Default (2048) provides good balance for music visualization.

Feature Extraction

Features extracted per video frame from the STFT spectrogram.

Energy Features

Feature Range Description Visual Mapping
rms 0-1 Root mean square energy (loudness) Size, scale, brightness
bass_energy 0-1 Energy in 20-250 Hz band Disk rotation, terrain height
mid_energy 0-1 Energy in 250-4000 Hz band Particle velocity, ring count
high_energy 0-1 Energy in 4000-20000 Hz band Sparkle, edge effects

Spectral Features

Feature Range Description Visual Mapping
spectral_centroid Hz “Brightness” of sound Color hue shift
peak_frequency Hz Dominant frequency Not typically used
onset_strength 0-1 Transient/attack strength Glitch intensity

Beat Features

Feature Type Description Visual Mapping
is_beat bool Hard beat marker Trigger zoom/flash/shake
beat_strength 0-1 Smooth beat envelope Continuous pulsing

Beat Detection

Uses spectral flux + peak picking algorithm.

Algorithm

1. Compute onset envelope from spectral flux:
   onset[i] = sum(max(spectrum[i] - spectrum[i-1], 0)²)

2. Smooth with moving average (window=5)

3. Find peaks above threshold:
   - threshold = max(onset) × threshold_ratio
   - min_interval between peaks

4. Convert peak frames to timestamps

Parameters

Parameter Default Effect
threshold_ratio 0.3 Lower = more beats detected
min_interval_sec 0.15 Prevents double-triggers
decay_rate 0.85 How fast beat_strength fades

Tempo Estimation

Two methods, highest confidence wins.

Method 1: Autocorrelation

1. Compute autocorrelation of onset envelope
2. Find peak in BPM range (60-200 BPM  lag frames)
3. Convert peak lag to BPM
4. Confidence = (peak - mean) / (2 × std)

Method 2: Beat Intervals

1. Calculate time between detected beats
2. Filter outliers (< 0.5× or > 2× median)
3. BPM = 60 / mean_interval
4. Confidence = 1 - coefficient_of_variation

Spectrum Binning

For GPU shaders, the spectrum is resampled to 128 bins.

Logarithmic Binning (Default)

Better for music - matches human perception of pitch.

Bin 0:    20-25 Hz    (sub-bass)
Bin 32:   ~100 Hz     (bass)
Bin 64:   ~500 Hz     (low-mids)
Bin 96:   ~2500 Hz    (presence)
Bin 127:  ~16000 Hz   (brilliance)

Linear Binning

Equal frequency spacing - rarely used.

Frequency Bands

Default band boundaries:

Band Range Typical Content
Bass 20-250 Hz Kick drum, bass guitar, sub
Mid 250-4000 Hz Vocals, guitars, snare
High 4000-20000 Hz Cymbals, hi-hats, air

Bands are configurable via AudioConfig.

Scene Detection

Energy-based segmentation without ML.

Algorithm

1. Compute rolling mean/std of RMS energy
2. Identify significant energy transitions
3. Classify segments based on:
   - Position (start/end of track)
   - Energy level (low/medium/high)
   - Energy trend (rising/falling/stable)

Scene Types

Scene Energy Position Trend
intro Low Start -
verse Medium Any Stable
build Medium→High Any Rising
drop High Any Peak
breakdown Low After drop Falling
outro Low End Falling