50 AI Voices: Which Reigns Supreme?

Text-to-speech (TTS) tech is getting crazy good, and I’m on a mission to compare every voice available to me. Let’s start with three I’ve been using lately:

  • Amazon Polly, historically a quick, cheap workhorse for informational speech.

  • ElevenLabs, which has led on quality and expressiveness.

  • OpenAI Text-to-Speech, six brand-spanking-new voices from the creators of ChatGPT!

Here’s as exhaustive a survey of all* the voices for all these 3 services, posted here because someone suggested I stick ‘em all in one place (this one’s for you, Ethan!). Scroll on down if you just want to hear the samples.

(*Amazon Polly neural voices, ElevenLabs’s base voices, and OpenAI’s 6 new ones.)

First: A Bit of History

Text-to-speech (TTS) tech for consumers has come a long way. I remember playing with early attempts—the TI 99/4A had an add-on hardware module that offered pretty basic, robotic speech:

With the Alien Voice Box, the Atari 800 offered an improvement:

Flash forward a handful of decades, and by the 2010s, TTS voices had improved a ton. Here's the original Apple Siri commercial:

As a game developer who’s done a lot of work with procedural/dynamic narrative, I’m excited about the prospect of characters giving performances that won’t break the player out of the moment. Here in the far future of 2023, AI/ML TTS offers more varied voices and more natural, expressive speech:

While not perfect, that’s not too darned shabby for a computer. This was ElevenLabs, fed with straight text—no information about delivery or nothin’.

Three TTS Services

I picked these services because they’re reasonably-priced, robust, and have have APIs for programmatic speech generation. They’re all speaking the same phrase:

Welcome, weary traveler, to the tender embrace of tonight's moonlight serenade, a soothing lullaby caressed by the whispers of distant crickets. The veil of the night is drawn, and the world is drenched in an iridescent silver hue. The moon, a radiant orb, hangs in the sapphire sky, casting shadows and illusions while weaving a tapestry of tranquility.

That’s a kinda-nonsense ASMR, courtesy of ChatGPT-4.

First: Amazon Polly Neural

1. Amazon Polly - It’s inexpensive at $16 per million characters, and it’s quick—each of the following took less than a second to return an OGG. I’m exclusively using their “Neural” voices, which they claim to be their best-quality ones:

The voice stumbles on the intro (“Welcome… Weary traveler…”), but is otherwise not too bad. Here are the rest of their English language Neural voices:

I like Polly. It’s quick—it took less than a half second round-trip for each of those. It’s historically been one of the cheaper options, at $0.000016/character input. And the voice isn’t super expressive, but it’s solid enough to get a point across.

I’d use Amazon Polly whenever I wanted to generate a ton of informational text quickly.

Second: OpenAI’s Six New Voices

Hot off the presses, OpenAI, creators of ChatGPT, have their first batch of a half dozen voices:

They’ve priced this at $0.015 per 1k characters. To my ears, these sound pretty good. We're still getting a weird pause after "Welcome," but each of those has a solid cadence, and Fable even does a passable Daniel Radcliffe.

They still strike me as being most appropriate for narrative/information—here’s their Fable voice with the line that ElevenLabs Elli did, above:

Fable sounds like he’s reading a script about being furious, but is not actually very angry at all.

Third: ElevenLabs

ElevenLabs is my pick for the most expressive of the three, with the most natural/interesting delivery and the greatest breadth of voices:

Right off the bat, the initial pause in “Welcome, weary traveler…” is more natural.

Fin is particularly expressive. There’s movement in both pitch and volume, the pauses are at the right places, and you can even hear breaths. There’s some artifacting in the audio, but overall, this is fantastic.

As with Fin, there’s lots of expression in Giovanni’s performance.

There’s much variation in style and quality among the ElevenLabs voices—Grace is kinda reading this like a laundry list.

I love the deep resonance of the Michael voice. I’d listen to an audiobook of this.

There’s some noise here, but I like that Nicole’s doing a whispered ASMR thing. It shows ElevenLabs’s range.

Serena sounds extremely overdriven; I’d consider this one straight up unusable.

Not all of the voices are great (Josh, Rachel, and Grace are hamfistedly reading from a script, and Freya is too harsh), and that comma after "Welcome" still trips some of them up. But they all seem to have different personalities. Fin is giving it everything he's got—there's a TON of expression in there. Same with Giovanni.

The pricing structure is different than with Polly or ElevenLabs—for the above, I’m using their $22/mo subscription, which allows me to generate 100k characters in a month. That’s $0.00022—about 14x as expensive as Polly or OpenAI. And it’s slower, too—it took around 2s to generate each of those. But to my ears, the highest-quality, most expressive of the bunch.

And One More…

ElevenLabs allows you do roll the dice and create a voice at random. To me, this one sounds like Phil Hartman as a Midwestern sports announcer attempting an English accent.

Previous
Previous

Push Button, Generate Audiobook: The AI Storyteller

Next
Next

Rapping About What’s On Steam