How to Evaluate Voice AI (when every demo looks the same)

Dec 10, 2025

Dear Reader,

If you’re a CX, growth, or tech leader at a consumer brand, you’ve probably seen dozens of voice AI pitches this year. Every week brings a new “Voice AI market map.” Every day, another cold email about “the next generation voice agent.”

The demos are slick. The claims are bold.

But when it comes to evaluating which platform will actually move your numbers, — most teams are flying blind. They over-index on “does it sound human?” and under-index on everything that matters in production a la when you put the voice AI in front of the customer.

Apurv Agrawal, Co-founder and CEO of SquadStack.ai (Blume Fund II), has written a sharp, practical framework to fix this — a compelling note on how to evaluate Voice AI properly. Blume Ventures participated in putting it together.

Here are the three most important takeaways (even if you don’t read the full write-up):

Start with the job, not the vendor

Before you touch a deck or a demo, ask: What exact job are we hiring Voice AI to do? Are we replacing something humans do today? Augmenting them? Or doing something we couldn’t do before — like proactive outbound at real scale? Without a clear job-to-be-done, every pitch sounds good, every pilot looks “promising,” and nothing scales.

“Sounds human” is the wrong first filter

Most teams over-index on whether the AI sounds human in the first 10 seconds. That matters, but naturalness is much more — can it handle interruptions, language switching, background noise, a frustrated user versus a curious one? The real test: blind listening tests where your team tries to spot the AI. And hard metrics like latency (<0.8 seconds) and ADR (abruptly disconnected rate).

The one question that cuts through the noise

Ask every vendor: “Can you show me one live use case where you’re beating a human baseline on the metric that matters — with numbers?” Not a recorded demo. Not a prototype. A live, in-production funnel with before/after data, time frame, and sample size. If they can’t show this, they might be early — fine for experiments, risky for core business flows.

If you’re evaluating Voice AI for sales, support, or collections, this is a must-read before you start the long, messy journey of finding the right partner.

Read the full framework → https://blume.vc/commentaries/how-to-evaluate-voice-ai-platforms

We hope you enjoy reading Apurv’s take on finding the right voice AI partner for you.

Team Blume

Field Notes

Discussion about this post

Ready for more?