music-video-creator-showcase

Audio Analysis

Deep dive into the STFT-based audio analysis pipeline.

Short-Time Fourier Transform (STFT)

The core of Music Video Creator’s audio analysis.

Parameters

Parameter	Default	Description
`window_size`	2048 samples	FFT window size (~46ms at 44.1kHz)
`hop_length`	512 samples	Window overlap (~11ms)
`window_function`	Hann	Windowing function
`sample_rate`	Auto-detect	From input WAV file

Trade-offs

Larger window_size:
  ✓ Better frequency resolution
  ✗ Worse time resolution (smeared transients)

Smaller window_size:
  ✓ Better time resolution (crisp transients)
  ✗ Worse frequency resolution (blurred bass)

Default (2048) provides good balance for music visualization.

Feature Extraction

Features extracted per video frame from the STFT spectrogram.

Energy Features

Feature	Range	Description	Visual Mapping
`rms`	0-1	Root mean square energy (loudness)	Size, scale, brightness
`bass_energy`	0-1	Energy in 20-250 Hz band	Disk rotation, terrain height
`mid_energy`	0-1	Energy in 250-4000 Hz band	Particle velocity, ring count
`high_energy`	0-1	Energy in 4000-20000 Hz band	Sparkle, edge effects

Spectral Features

Feature	Range	Description	Visual Mapping
`spectral_centroid`	Hz	“Brightness” of sound	Color hue shift
`peak_frequency`	Hz	Dominant frequency	Not typically used
`onset_strength`	0-1	Transient/attack strength	Glitch intensity

Beat Features

Feature	Type	Description	Visual Mapping
`is_beat`	bool	Hard beat marker	Trigger zoom/flash/shake
`beat_strength`	0-1	Smooth beat envelope	Continuous pulsing

Beat Detection

Uses spectral flux + peak picking algorithm.

Algorithm

1. Compute onset envelope from spectral flux:
   onset[i] = sum(max(spectrum[i] - spectrum[i-1], 0)²)

2. Smooth with moving average (window=5)

3. Find peaks above threshold:
   - threshold = max(onset) × threshold_ratio
   - min_interval between peaks

4. Convert peak frames to timestamps

Parameters

Parameter	Default	Effect
`threshold_ratio`	0.3	Lower = more beats detected
`min_interval_sec`	0.15	Prevents double-triggers
`decay_rate`	0.85	How fast beat_strength fades

Tempo Estimation

Two methods, highest confidence wins.

Method 1: Autocorrelation

Compute autocorrelation of onset envelope
Find peak in BPM range (60-200 BPM → lag frames)
Convert peak lag to BPM
Confidence = (peak - mean) / (2 × std)

Method 2: Beat Intervals

Calculate time between detected beats
Filter outliers (< 0.5× or > 2× median)
BPM = 60 / mean_interval
Confidence = 1 - coefficient_of_variation

Spectrum Binning

For GPU shaders, the spectrum is resampled to 128 bins.

Logarithmic Binning (Default)

Better for music - matches human perception of pitch.

Bin 0:    20-25 Hz    (sub-bass)
Bin 32:   ~100 Hz     (bass)
Bin 64:   ~500 Hz     (low-mids)
Bin 96:   ~2500 Hz    (presence)
Bin 127:  ~16000 Hz   (brilliance)

Linear Binning

Equal frequency spacing - rarely used.

Frequency Bands

Default band boundaries:

Band	Range	Typical Content
Bass	20-250 Hz	Kick drum, bass guitar, sub
Mid	250-4000 Hz	Vocals, guitars, snare
High	4000-20000 Hz	Cymbals, hi-hats, air

Bands are configurable via AudioConfig.

Scene Detection

Energy-based segmentation without ML.

Algorithm

1. Compute rolling mean/std of RMS energy
2. Identify significant energy transitions
3. Classify segments based on:
   - Position (start/end of track)
   - Energy level (low/medium/high)
   - Energy trend (rising/falling/stable)

Scene Types

Scene	Energy	Position	Trend
intro	Low	Start	-
verse	Medium	Any	Stable
build	Medium→High	Any	Rising
drop	High	Any	Peak
breakdown	Low	After drop	Falling
outro	Low	End	Falling