Streaming Read-Aloud: From Buffered Waits to Sentence-Level Pipelining
How we made Cereby speak its replies as it thinks them, instead of after it has finished thinking
The toggle that felt broken
Inside Cereby's chat input there is a small toggle: "Read AI responses aloud." When it is on, the assistant's replies are spoken back to the learner. It sounds simple. The first version was simple, and it felt slow.
The complaint was not abstract. With the toggle on, a student would ask a question, watch the answer stream onto the screen, finish reading it, and then the audio would start. The voice was always behind the eyes. For a feature whose entire reason to exist is eyes-off study (commute, kitchen, walking), being behind the eyes is the wrong failure mode.
We rebuilt the pipeline. The first sentence now plays roughly when the model finishes saying it, not when the model finishes the entire reply. For a four-paragraph answer, that is the difference between roughly 500ms and several seconds of dead air. Same toggle, same OpenAI tts-1 calls, same alloy voice, same per-character cost.
What was wrong with the first version
The old shape looked like this:
This is the most obvious thing you would build, and it is wrong for two reasons that compound.
The model already streamed tokens. By collecting them into a single payload before doing anything, we threw away latency we had been paid for. Every second of generation was a second of silence. Then we called the speech endpoint one chunk at a time, waiting for each MP3 before requesting the next: TTS round-trips stacked end to end instead of overlapping with playback.
Chunking made it worse. Chunks were sized for the API limit (4000 characters), not for the ear. A learner could finish reading an entire paragraph before its audio chunk had even been requested.
The result was the worst of two worlds: text that streamed at modern speeds, and audio that arrived like it was being faxed.
The new shape: stream the text, pipeline the speech
We split the problem into three layers and let each one stream independently.
The key insight is to stop treating the reply as a noun and start treating it as a stream of sentences. None of the three layers is exotic alone; the win is in composing them so the user-perceived latency is one TTS round-trip, and that round-trip happens during generation rather than after it.
How each layer works
The sentence splitter
Splitting on sentence boundaries while text is still arriving is fiddly: the trailing fragment may not be a complete sentence yet. The rule we settled on:
keep a "pending" buffer that grows with each delta
on every delta:
look for sentence terminators (. ? ! and language-aware variants)
for every terminator followed by whitespace (or end-of-buffer + idle):
emit the sentence up to and including the terminator
keep the rest as the new pending buffer
on stream end:
flush pending if non-empty
The TTS queue and prefetch cache
The audio hook keeps a playback queue, an object-URL cache keyed by sentence text, and an in-flight promise map so duplicate synthesis calls are deduplicated. The contract:
speakSentence(text):
push text onto queue
if nothing is currently playing:
start playback loop
schedule prefetch for the next two queue entries
playback loop:
take head of queue
if cache has it: play immediately
else if prefetch has it: await the in-flight promise, then play
else: synthesize, then play
on audio "ended": loop again, kick off prefetch for the new lookahead
The prefetch lookahead is two sentences deep. One was not enough to hide a TTS round-trip on short sentences; three wasted calls when users paused or cancelled. Two kept the gap below the threshold where listeners notice. Object URLs are revoked on eviction or unmount: long sessions otherwise leak tens of megabytes of decoded audio into the tab.
The speech endpoint
The route is a thin wrapper around OpenAI's tts-1 with the alloy voice. The meaningful change was response shape: for ephemeral chat audio it now returns raw audio/mpeg bytes, so the client makes one constructor call to get an object URL instead of decoding base64 on the hot path. For persisted audio (podcasts, lesson narration) it still returns the old JSON-plus-storage shape. One endpoint, two response modes, picked by the caller.
The failures that did not show up in the design doc
Users sometimes cancel mid-generation. Cancellation drops queued sentences and revokes prefetched object URLs; we do not abort the in-flight TTS request since it was usually already returning, but we throw the result away.
Very short replies (a one-word answer) hit a timing bug where the prefetch logic ran for a queue that was already empty. The fix was making the playback loop and the prefetch scheduler share a synchronous "currently playing" ref instead of a React state value, so the second speakSentence call does not see a stale snapshot before the first one has flipped the flag.
Identical short sentences ("Sure!", "Got it.", "Let me think.") were being synthesized over and over. The cache key is the first 100 characters of the text. Cheap, hash-free, good enough.
Mobile Safari needs audio elements created inside a user gesture before programmatic plays are allowed. We prime an empty Audio object the first time the toggle is enabled, so subsequent plays go through without a gesture requirement.
| Dimension | Before (buffered) | After (streaming + pipelined) |
|---|---|---|
| Time-to-first-audio | full LLM latency + 1 TTS call | ~1 sentence of LLM latency + 1 TTS call |
| Inter-sentence gap | 1 TTS call (sequential) | near zero (prefetched) |
| TTS calls per reply | one per <=4000-char chunk | one per sentence |
| Wire format (chat) | buffered JSON | SSE deltas (opt-in) |
| Wire format (TTS) | base64 in JSON | raw audio/mpeg |
| Client cache | none | sentence-keyed object URLs + in-flight promise dedup |
Buffer one layer up, never two. When two streaming systems sit back to back (LLM then TTS), buffering between them throws away both streams. The fix is almost never a bigger buffer; it is removing the buffer.
Next: adaptive lookahead (the right prefetch depth for lists and short bullets differs from prose) and end-to-end time-to-first-sentence-audible observability, which is the metric that actually maps to the original complaint. Voice selection and a cross-session audio cache are further out.
For the input side of this story (push-to-talk dictation and AI-generated podcasts) see Voice-Powered Learning: Cereby's New Audio Capabilities.
