Show HN: I built a sub-500ms latency voice agent from scratch

(ntik.me)

244 points | by nicktikhonov 6 hours ago

31 comments

jedberg 3 hours ago
Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on).
An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".
It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.
Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.
Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".
Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.
This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.
Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.
[-]
- nicktikhonov 3 hours ago
  This is fascinating, thanks for sharing! I wonder why amazon/google/apple didn't hop on the voice assistant/agent train in the last few years. All 3 have existing products with existing users and can pretty much define and capture the category with a single over-the-air update.
  [-]
  - jedberg 2 hours ago
    Two main reasons:
    1. Compute. It's easy to make a voice assistant for a few people. But it takes a hell of a lot of GPU to serve millions.
    2. Guard Rails. All of those assistants have the ability to affect the real world. With Alexa you can close a garage or turn on the stove. It would be real bad if you told it to close the garage as you went to bed for the night and instead it turned on the stove and burned down the house while you slept. So you need so really strong guard rails for those popular assistants.
    3 And a bonus reason: Money. Voice assistants aren't all the profitable. There isn't a lot of money in "what time is it" and "what's the weather". :)
    [-]
    - mcbits 2 hours ago
      > There isn't a lot of money in "what time is it" and "what's the weather". :)
      - Alexa, what time is it?
      - Current time is 5:35 P.M. - the perfect time to crack open a can of ice cold Budweiser! A fresh 12-pack can be delivered within one hour if you order now!
      [-]
      - jedberg 2 hours ago
        If your Alexa did that, how quickly would you box it up and send it to me. :)
        I am serious though about having it sent to me: if anyone has an Alexa they no longer want, I'm happy to take it off your hands. I have eight and have never bought one. Having worked there I actually trust the security more than before I worked there. It was basically impossible for me, even as a Principle Engineer, to get copies of the Text to Speech of a customer and I literally never heard a customer voice recording.
        [-]
        stavros 1 hour ago
        I'm puzzled by this conversation, because Amazon did get on the agent bandwagon with Alexa Plus (I have it, it's buggier than regular Alexa and it's all making me throw my Echos away since they can't even play Spotify reliably).
        Also, my Alexa does advertise stuff to me when I talk to it. It's not Budweiser, but it'll try to upsell me on Amazon services all the time.
        [-]
        jedberg 1 hour ago
        > because Amazon did get on the agent bandwagon with Alexa Plus
        Which just launched last year, about four years after ChatGPT had AI voice chat. And it costs extra money to cover the costs. And as you aptly point out, all the guardrails they had to put in made the experience less than ideal.
        > Also, my Alexa does advertise stuff to me when I talk to it.
        Yes, that is how they try to make money. And it's gotten worse. But how many times does it get you to buy something?
        alexastoplying 1 hour ago
        What a way to throwaway good will. I also worked there and to get access to text you simply had to grab the DSN of your device, attest that it’s yours and it gets put in a “pool” of devices that are tracked until removed. On each end you are basically waved through with no checks. This was usually done when debugging tricky UI bugs or new features as the request followed through several micro services. I do not believe the a PE would not know this. And one with patents.
        [-]
        jedberg 1 hour ago
        That was your own device. Not other customers.
        [-]
        argee 49 minutes ago
        Don't feed the trolls, Jeremy.
        [-]
        jedberg 41 minutes ago
        But they're hungry!
- esperent 1 hour ago
  > median delay
  Does that mean that half of responses have a negative delay? As in, humans interrupt each others sentences precisely half of the time?
  [-]
  - jedberg 1 hour ago
    Yes about 1/2 of human speech is interrupting others.
  - vcxy 1 hour ago
    I assume 0 delay is the minimum here, and a median of 0 means over half of the data are exactly 0.
    [-]
    - jedberg 1 hour ago
      No, about 1/2 of human speech is interrupting others.
      [-]
      - vcxy 15 minutes ago
        oh, interesting, I assumed the data came from interruptions (that seemed obvious) but I'm surprised you had some specific negative measurements. How do you decide the magnitude of the number? Just counting how long both parties are talking?
        [-]
        jedberg 8 minutes ago
        To be clear, it wasn't my research, I got it from studying some linguistics papers. But it was pretty straightforward. If I am talking, and then you interrupt, and 300ms later I stop talking, then the delay is -300ms.
        Same the other way. If I stop taking and then 300ms later you start talking, then the delay is 300ms.
        And if you start talking right when I stop, the delay is 0ms.
        You can get the info by just listening to recorded conversations of two people and tagging them.
brody_hamer 3 hours ago
> Voice is a turn-taking problem
It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)
[-]
- nicktikhonov 3 hours ago
  100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.
  [-]
  - dotancohen 2 hours ago
    Years ago I wrote a system that would generate Lucene queries on the fly and return results. The ~250 ms response time was deemed too long, so I added some information about where the response data originated, and started returning "According to..." within 50 ms of the end of user input. So the actual information got to the user after a longer delay, but it felt almost as fast as conversion.
- starkparker 3 hours ago
  Recently: https://blog.livekit.io/prompting-voice-agents-to-sound-more...
- phkahler 3 hours ago
  Better if it can anticipate its response before you're done speaking. That would be subject to change depending what the speaker says, but it might be able to start immediately.
  [-]
  - fragmede 1 hour ago
    it's bad enough how to deal with people that don't think before they speak now we gotta make the computers do it as well‽
kaonwarb 17 minutes ago
One of the challenges with trying to achieve IRL human-level latency is that we rely on nonverbal cues for face-to-face turn-taking. See e.g. https://www.sciencedirect.com/science/article/pii/S001002772...
armcat 5 hours ago
This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova
[-]
- nicktikhonov 5 hours ago
  Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like
  [-]
  - alfalfasprout 3 hours ago
    Also consider using Cerebras' inference APIs. They released a voice demo a while back and the latency of their model inference is insane.
mst98 3 minutes ago
This is so cool
modeless 5 hours ago
IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.
[-]
- nicktikhonov 5 hours ago
  If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:
  https://research.nvidia.com/labs/adlr/personaplex/
  [-]
  - woodson 5 hours ago
    You mean Moshi (https://github.com/kyutai-labs/moshi)? Since Personaplex is just a finetuned Moshi model.
    [-]
    - mountainriver 4 hours ago
      Yeah except moshi doesn’t sound good at all
- rockwotj 1 hour ago
  Fundamentally, the "guessing when its your turn thing" needs to be baked into the model. I think the full duplex mode that Moshi pioneered is probably where the puck is going to end up: https://arxiv.org/abs/2410.00037
- com2kid 3 hours ago
  The advantage is being able to plug in new models to each piece of the pipeline.
  Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).
- donpark 2 hours ago
  But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.
  [-]
  - tmzt 1 hour ago
    Gemini Nano is supposedly doing it on device. It looks like something similar should work with Apple GPU and ANE.
lukax 5 hours ago
Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.
https://soniox.com/docs/stt/rt/endpoint-detection
Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.
https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
You can try a demo on the home page:
https://soniox.com/
Disclaimer: I used to work for Soniox
Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.
[-]
- nicktikhonov 5 hours ago
  If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.
  [-]
  - satvikpendem 1 hour ago
    I second Soniox as well, as a user. It really does do way better than Deepgram and others. If your app architecture is good enough then maybe replacing providers shouldn't be too hard.
  - lukax 5 hours ago
    Sorry, I commented too soon. Did you also try Soniox? Why did you decide to use Deepgram's Flux (English only)?
    [-]
    - nicktikhonov 5 hours ago
      I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.
      Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:
      https://research.nvidia.com/labs/adlr/personaplex/
- satvikpendem 1 hour ago
  I'm using them, how has it been like working there? I see they have some consumer products as well. I wonder how they get state of the art for such low prices over the competition.
NickNaraghi 6 hours ago
Pretty exciting breakthrough. This actually mirrors the early days of game engine netcode evolution. Since latency is an orchestration problem (not a model problem) you can beat general-purpose frameworks by co-locating and pipelining aggressively.
Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this.
[0]: https://danluu.com/latency-mitigation/
ggm 38 minutes ago
Thats half a second delay. 0.4 to 0.5 seconds. Thats the same as the delay in a GEO orbit satellite mediated phone conversation.
Perhaps I'm in an older cohort, but I remember this delay, and what it felt like sustaining a conversation with this class of delay.
(it's still a remarkable advance, but do bear in mind the UX)
hosaka 40 minutes ago
Depending on the TTS model being used latency can be reduced further yet with an LRU cache, fetching common phrases from cache instead of generating fresh with TTS.
However the naturalness of how it sounds will depend on how the TTS model works and whether two identical chunks of text will sound alike every generation.
age123456gpg 5 hours ago
Hi all! Check out this Handy app https://github.com/cjpais/Handy - a free, open source, and extensible speech-to-text application that works completely offline.
I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode).
suganesh95 1 hour ago
This is great. I built 3 assistants last week for same purpose with entirely different tech stack.
(Raspberry Pi Voice Assistant)
Jarvis uses Porcupine for wake word detection with the built-in "jarvis" keyword. Speech input flows through ElevenLabs Scribe v2 for transcription. The LLM layer uses Groq llama-3.3-70b-versatile as primary with Groq llama-3.1-8b-instant as fallback. Text-to-speech uses Smallest.ai Lightning with Chetan voice. Audio input/output handled by ALSA (arecord/aplay). End-to-end latency is 3.8–7.3 seconds.
(Twilio + VPS)
This setup ingests audio via Twilio Media Streams in μ-law 8kHz format. Silero VAD detects speech for turn boundaries. Groq Whisper handles batch transcription. The LLM stack chains Groq llama-4-scout-17b (primary), Groq llama-3.3-70b-versatile (fallback 1), and Groq llama-3.1-8b-instant (fallback 2) with automatic failover. Text-to-speech uses Smallest.ai Lightning with Pooja voice. Audio is encoded from PCM to μ-law 8kHz before streaming back via Twilio. End-to-end latency is 0.5–1.1 seconds.
───
(Alexa Skill)
Tina receives voice input through Alexa's built-in ASR, followed by Alexa's NLU for intent detection. The LLM is Claude Haiku routed through the OpenClaw gateway. Voice output uses Alexa's native text-to-speech. End-to-end latency is 1.5–2.5 seconds.
suganesh95 1 hour ago
I built something very similar and comparble to this with wakeword detection on my raaberry pi.
Groq 8b instant is the fastest llm from my test. I used smallest ai for tts as it has the smallest TTFT
My rasberry pi stack: porcupine for wakeword detection + elevenlabs for STT + groq scout as it supports home automation better + smallest.ai for 70ms ttfb
Call stack: twilio + groq whisper for STT + groq 8b instant + smallest.ai for tts
Alexa skill stack: wrote a alexa skill to contact my stack running on a VPS server
kelvinjps10 1 hour ago
The quality of the post was amazing, I'm not that interested into voice agents yet but that I was engaged in the whole post. And the little animation made it easier to understand the loop.
docheinestages 5 hours ago
Does anyone know about a fully offline, open-source project like this voice agent (i.e. STT -> LLM -> TTS)?
[-]
- nicktikhonov 5 hours ago
  A friend built this, everything working in-browser:
  https://ttslab.dev/voice-agent
loevborg 5 hours ago
Nice write-up, thanks for sharing. How does your hand-vibed python program compare to frameworks like pipecat or livekit agents? Both are also written in python.
[-]
- nicktikhonov 5 hours ago
  I'm sure LiveKit or similar would be best to use in production. I'm sure these libraries handle a lot of edge cases, or at least let you configure things quite well out of the box. Though maybe that argument will become less and less potent over time. The results I got were genuinely impressive, and of course most of the credit goes to the LLM. I think it's worth building this stuff from scratch, just so that you can be sure you understand what you'll actually be running. I now know how every piece works and can configure/tune things more confidently.
perelin 5 hours ago
Great writeup! For VAD did you use heaphone/mic combo, or an open mic? If open, how did you deal with the agent interupting itself?
[-]
- nicktikhonov 5 hours ago
  I was using Twilio, and as far as I'm aware they handle any echos that may arise. I'm actually not sure where in the telephony stack this is handled, but I didn't see any issues or have to solve this problem myself luckily.
MbBrainz 6 hours ago
Love it! Solving the latency problem is essential to making voice ai usable and comfortable. Your point on VAD is interesting - hadn't thought about that.
nmstoker 3 hours ago
This was discussed 21 days ago:
https://news.ycombinator.com/item?id=46946705
[-]
- upmind 3 hours ago
  "extensively" = 2 comments?
  [-]
  - dotancohen 2 hours ago
```
  > "extensively" = 2 comments
```
    Possibly GP has teenagers. Two comments is a pretty extensive discussion with teenagers ))
  - nmstoker 2 hours ago
    You're right, fixed it. I discussed it extensively with a colleague and that got conflated. It's a great article.
waynerisner 1 hour ago
I am really curious about this for enunciation, articulation, and accessibility applications.
boznz 5 hours ago
"Voice is an orchestration problem" is basically correct. The two takeaways from this for me are
1. I wonder if it could be optimised more by just having a single language, and
2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments.
bronco21016 1 hour ago
When someone is able to put something like this together on their own it leaves me feeling infuriated that we can’t have nice things on consumer hardware.
At a minimum Siri, Alexa, and Google Home should at least have a path to plugin a tool like this. Instead I’m hacking together conversation loops in iOS Shortcuts to make something like this style of interaction with significantly worse UX.
grayhatter 3 hours ago
You made, or you asked an LLM to generate?
[-]
- nicktikhonov 3 hours ago
  I'd say it was a collaboration. I had to hand-hold Claude quite a bit in the early stages, especially with architecture, and find the right services to get the outcome I wanted. But if you care most about where the code came from - it was probably 85-90% LLM, and that's fantastic given that the result is as performant as anything you'll be able to find out of the box.
shubh-chat 4 hours ago
This is superb, Nick! Thanks for this. Will try it out at somepoint for a project I am trying to build.
jangletown 6 hours ago
impressive
hackersk 2 hours ago
[dead]
aplomb1026 3 hours ago
[dead]
andrewmcwatters 3 hours ago
[dead]
foxes 2 hours ago
<think> I need to generate a Show HN: style comment to maximise engagement as the next step. Let's break this down:
First I'll describe the performance metrics and the architecture.
Next I'll elaborate on the streaming aspect and the geographical limitations important to the performance.
Finally the user asked me to make sure to keep the tone appropriate to Hacker News and to link their github – I'll make sure to include the link. </think>
CagedJean 5 hours ago
Do you have hot talk when you are alone in the shower with HER?
[-]
- nicktikhonov 4 hours ago
  Gross