Audio & Narration - Making of PolyFish

Web Audio API Architecture

PolyFish's audio system sits atop the Web Audio API, which offers low-latency, sample-accurate control over sound playback and mixing. The core design uses a three-channel gain node architecture: music, narration, and SFX (sound effects), each with its own volume control and a master gain node that routes everything to the system output.

The AudioContext is created lazily - only when the page first loads or when the user interacts with the page. This prevents unnecessary browser resource allocation. The context state starts as suspended on mobile browsers, especially iOS Safari, which enforces autoplay policies. Resuming the context requires a user gesture.

Lazy Initialization - The AudioContext is not created until the first user interaction or until audio is explicitly requested. This respects browser autoplay policies and reduces overhead for users who never play audio. Once created, it remains open for the lifetime of the session.

Audio Architecture - three channels mixing into master output

The three channels operate independently. Music tracks feed into the music gain node at 0.25 volume (25% of max) to sit underneath narration. Narration clips use the narration channel at 0.85 volume (85% of max) so the voice is prominent. SFX (creature vocalizations, water pops, clicks) use the SFX channel at 0.5 volume (50% of max). The master gain is set to 1.0, relying on the channel volumes to provide the overall mix balance.

Browser Autoplay Policy

Modern browsers restrict autoplay audio without user interaction - a policy designed to prevent intrusive sound from playing unexpectedly. iOS Safari is especially strict: the AudioContext starts in a suspended state and cannot produce sound until the user performs a gesture like a click, keydown, or touch event.

PolyFish handles this by checking the context state and resuming it if needed. The restartMusic() method, which begins playback of the ambient music loop, automatically resumes the context on the first call. If the context is already running, the resume call is a no-op; if it's suspended, it transitions to running and audio becomes audible.

Gesture Unlock - iOS requires an actual user gesture (click, tap, or key press) before audio can play. On PolyFish, the first time a user clicks the canvas or presses a key to interact with the scene, the audio context resumes automatically. Subsequent calls to audio playback methods work without additional intervention.

AudioContext Lifecycle - suspended → resume (gesture) → running

The fade-in ramp prevents audible clicks and pops that occur when audio sources start at full volume. A linear ramp from silence to 0.25 over 2 seconds is imperceptible to the listener but prevents digital artifacts.

Music Playlist System

PolyFish features three ambient music tracks that loop indefinitely, creating a shuffled, never-repeating background soundscape. The playlist uses the Fisher-Yates shuffle algorithm to randomize track order, ensuring no two consecutive plays use the same track and avoiding predictable patterns.

When one track finishes, the system automatically advances to the next via an ended event listener. Each new track starts with a 2-second fade-in to mask the transition and create a crossfading effect. This gives the impression of continuous, seamless music rather than discrete clips starting and stopping.

Playlist Shuffle - Fisher-Yates algorithm for variety

The playlist shuffles itself after every complete cycle (3 tracks), ensuring variety while preventing the same track from appearing back-to-back. This simple Fisher-Yates shuffle runs in O(n) time and produces a uniform random permutation.

Audio Signal Chain Diagram

Narration & Music Ducking

When narration plays, the music automatically "ducks" - its volume is reduced to let the voice come through clearly. This is a professional mixing technique used in podcasts, audiobooks, and films. PolyFish implements it as an event-triggered gain reduction that restores the original level when narration ends.

The system uses four timed narration cues that trigger at specific moments during the simulation:

Welcome (4.15s) - Introduction to PolyFish and the ocean scene
PolyFish Intro (7.15s) - Explanation of the geometric sea creatures
Manatee Intro (74.15s) - The gentle manatee and its behavior
Dolphin Intro (104.65s) - The energetic dolphin swimming through

Each narration clip is approximately 20-30 seconds long. Music ducking ramps down over 0.8 seconds to allow the voice to cut through cleanly. When narration ends, the music fades back to its original level over the same 0.8-second ramp, creating a smooth transition.

const narrationCues = [
  { time: 4.15, clip: 'welcome', duration: 28 },
  { time: 7.15, clip: 'polyfish_intro', duration: 25 },
  { time: 74.15, clip: 'manatee_intro', duration: 22 },
  { time: 104.65, clip: 'dolphin_intro', duration: 18 },
];

checkNarrationTriggers(simulationTime) {
  for (const cue of narrationCues) {
    if (Math.abs(simulationTime - cue.time) < 0.05) {
      this.triggerNarration(cue.clip, cue.duration);
    }
  }
}

triggerNarration(clipName, duration) {
  const buffer = this.audioBuffers.get(`narration_${clipName}`);
  const source = this.ctx.createBufferSource();
  source.buffer = buffer;
  source.connect(this.narrationGain);

  // Duck music immediately
  const now = this.ctx.currentTime;
  this.musicGain.gain.setValueAtTime(
    this.musicGain.gain.value,
    now
  );
  this.musicGain.gain.linearRampToValueAtTime(
    0.075,
    now
  );

  // Restore after narration ends
  this.musicGain.gain.setValueAtTime(
    0.075,
    now + duration
  );
  this.musicGain.gain.linearRampToValueAtTime(
    0.25,
    now + duration + 0.5
  );

  source.start(now);
}

The ducking ratio is 0.25 → 0.075, a reduction to 30% of the original music volume. This is aggressive enough to keep the narration intelligible while still maintaining some ambient presence. The 0.5-second restore ramp is smooth enough to feel natural but fast enough to re-establish the music soundscape before the next cue.

Narration Timeline & Music Ducking

Positional vs Stereo Audio Demo

This demo lets you experience the difference between stereo panning and positional (3D spatial) audio. The sound source (shown as a circle) can be dragged around the canvas. In stereo mode, the volume remains equal in both ears and only pans left-right. In positional mode, the Web Audio API's PannerNode creates 3D spatial effects - as the source moves, the audio response mimics how your ears would perceive sound in three-dimensional space.

Positional vs Stereo Audio - Drag the sound source to hear the difference

Spatial Audio for VR

PolyFish supports spatial audio positioning for creature vocalizations and ambient sound sources. The Web Audio API provides the PannerNode, which implements HRTF (Head-Related Transfer Function) panning - a technique that simulates 3D sound by using frequency-dependent phase and amplitude shifts that mimic how human ears locate sound sources in space.

Each creature sound source has a position in world space. The listener position (synced to the camera every frame) and the source position determine the pan angle and distance falloff. For PolyFish, the distance model uses inverse falloff over a 1-100 unit range: sounds close to the listener are loud, and sounds far away fade according to the inverse square law.

HRTF Panning - The PannerNode uses head-related transfer functions to create convincing 3D spatial audio. Even on stereo headphones, HRTF makes it possible to perceive whether a sound is coming from above, below, in front, or behind. PolyFish uses this for creature calls and directional ambient effects.

// Spatial audio for creature vocalizations
playCreatureSFX(position, clipName) {
  const buffer = this.audioBuffers.get(`sfx_${clipName}`);
  const source = this.ctx.createBufferSource();
  const panner = this.ctx.createPanner();

  source.buffer = buffer;

  // Configure panner for inverse distance falloff
  panner.panningModel = 'HRTF';
  panner.distanceModel = 'inverse';
  panner.refDistance = 1;
  panner.maxDistance = 100;
  panner.rolloffFactor = 1;

  // Set source position in 3D space
  panner.positionX.value = position.x;
  panner.positionY.value = position.y;
  panner.positionZ.value = position.z;

  source.connect(panner);
  panner.connect(this.sfxGain);
  source.start(this.ctx.currentTime);
}

updateListenerPosition(cameraPosition, cameraForward) {
  const listener = this.ctx.listener;

  // Listener position synced to camera every frame
  listener.positionX.value = cameraPosition.x;
  listener.positionY.value = cameraPosition.y;
  listener.positionZ.value = cameraPosition.z;

  // Forward and up vectors for head orientation
  listener.forwardX.value = cameraForward.x;
  listener.forwardY.value = cameraForward.y;
  listener.forwardZ.value = cameraForward.z;
  listener.upX.value = 0;
  listener.upY.value = 1;
  listener.upZ.value = 0;
}

The listener's position and orientation are updated every frame in sync with the camera. This creates the illusion that sounds move around the listener's head as they move through the scene. A creature 10 meters to the left will sound like it's coming from the left speaker; as the listener turns, the pan angle shifts accordingly.

Loading & Lifecycle

Audio assets - music tracks, narration clips, and SFX - are loaded asynchronously in parallel using Promise.allSettled. This prevents a single failed fetch from blocking the entire audio system. Decoded audio buffers are cached in a Map for instant playback without re-decoding overhead.

When the audio manager is disposed (e.g., on page unload), the AudioContext is closed to release system resources. Any playing sources are stopped, and the buffer cache is cleared.

async loadAudioAssets() {
  const files = [
    { key: 'music_0', url: '/audio/ambient_track_1.mp3' },
    { key: 'music_1', url: '/audio/ambient_track_2.mp3' },
    { key: 'music_2', url: '/audio/ambient_track_3.mp3' },
    { key: 'narration_welcome', url: '/audio/narration_welcome.mp3' },
    { key: 'narration_polyfish_intro', url: '/audio/narration_polyfish.mp3' },
    { key: 'sfx_dolphin_click', url: '/audio/sfx_dolphin.mp3' },
    // ... more assets
  ];

  const results = await Promise.allSettled(
    files.map(async (f) => {
      const resp = await fetch(f.url);
      const arrayBuf = await resp.arrayBuffer();
      const decoded = await this.ctx.decodeAudioData(arrayBuf);
      this.audioBuffers.set(f.key, decoded);
    })
  );

  // Log any failures but continue
  for (const result of results) {
    if (result.status === 'rejected') {
      console.error('Audio asset load failed:', result.reason);
    }
  }
}

dispose() {
  if (this.currentTrack) {
    this.currentTrack.source.stop();
  }
  if (this.ctx) {
    this.ctx.close();
  }
  this.audioBuffers.clear();
}

Using Promise.allSettled rather than Promise.all ensures that a single corrupt or missing audio file doesn't prevent the rest from loading. The game can continue to run with degraded audio - music might be missing, but narration and SFX can still function. This graceful degradation is crucial for a robust web audio experience.

← Previous Cinematic Camera Next → Performance

On this page

Web Audio API Architecture

Browser Autoplay Policy

Music Playlist System

Narration & Music Ducking

Positional vs Stereo Audio Demo

Spatial Audio for VR

Loading & Lifecycle