Skip to content

AudioResampler 16kHz→24kHz outputs near-silence when fed 10ms frames (realtime framing) #679

@rinarakaki

Description

@rinarakaki

Versions

@livekit/rtc-node 0.13.29, linux x64 (also reproduced inside the @livekit/agents 1.4.4 STT pipeline, which constructs AudioResampler(16000, 24000) internally for realtime STT plugins).

Behavior

Upsampling 16 kHz mono pcm16 speech to 24 kHz with AudioResampler destroys the signal when the input is pushed in 10 ms frames (160 samples) — the framing AudioStream emits and the realtime case. Larger pushes progressively recover:

Push size Output RMS (input RMS 3770) Output sample count
160 samples (10 ms) 34.6 — near-silence 177,600 (correct)
1,600 samples (100 ms) 2,549 — degraded 177,600
16,000 samples (1 s) 4,280 — healthy 168,000

The output sample count is correct in all cases — only the content collapses. No aliasing/buffer-reuse involved (snapshots taken at return time equal late reads).

Impact

Any pipeline that feeds live AudioStream frames (10 ms) through AudioResampler(16000→24000) — e.g. @livekit/agents' base STT class resampling for OpenAI realtime STT (24 kHz) — sends near-silence to the provider. Server-side VAD never triggers, so the result is zero transcription events with no errors anywhere: a fully silent failure.

Repro

import { AudioFrame, AudioResampler } from '@livekit/rtc-node';

// s16: Int16Array of 16kHz mono speech (we used ~7s of TTS audio)
const rms = (a) => { let s = 0; for (let i = 0; i < a.length; i++) s += a[i] * a[i]; return Math.sqrt(s / a.length); };
console.log('input rms:', rms(s16));

for (const chunk of [160, 1600, 16000]) {
  const r = new AudioResampler(16000, 24000);
  const out = []; let n = 0;
  for (let i = 0; i + chunk <= s16.length; i += chunk)
    for (const f of r.push(new AudioFrame(s16.subarray(i, i + chunk), 16000, 1, chunk))) {
      out.push(Int16Array.from(f.data)); n += f.samplesPerChannel;
    }
  const all = new Int16Array(n); let p = 0;
  for (const s of out) { all.set(s, p); p += s.length; }
  console.log(`chunk=${chunk}: outSamples=${n} rms=${rms(all).toFixed(1)}`);
}

Workaround

Request the target rate from AudioStream directly (new AudioStream(track, { sampleRate: 24000 })) so WebRTC's internal resampler runs instead — output is correct and downstream transcription works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions