Article

Your AI has a voice but no character.

Text-to-speech has become remarkably good. Your AI assistant can speak with nuance, emotion, even humor. But speech is only half of how we communicate. The other half — the sounds around the words — is where character actually lives.

The missing layer

Think about a conversation with a person you trust. Now think about what makes it feel trustworthy. It's not just the words. It's the breath before the sentence. The slight pause that signals "I'm thinking about this seriously." The hum that means "I'm listening." The shift in tone that tells you the mood has changed before anyone says so explicitly.

These are non-speech sounds. They're not words, not language, not content. They're the subtext that runs alongside speech — and they carry most of the emotional information in a conversation. When they're absent, speech feels mechanical. When they're present, speech feels human.

Modern TTS can now simulate some of this — intonation, pacing, emphasis, even filler sounds like "uhm" and breathing pauses that make the output feel more natural. But this is mimicry, not communication. The system isn't actually breathing. It isn't actually hesitating. It's imitating human speech patterns without connecting them to any real state. The sounds are imprecise — they signal humanity without carrying meaning.

And there's a whole layer that TTS doesn't touch at all: the sounds between the utterances, the acoustic behavior around the voice, the sonic character of the system when it's not speaking.

What happens before the first word

Your AI assistant receives a query. It processes. It's about to respond. What does the user hear in that moment?

Usually: nothing. Or a loading animation with no sound. Some products bridge the gap with a thinking indicator — a generic tone that signals "processing." But it's the same tone every time, regardless of what the user said. It doesn't acknowledge the question. It doesn't signal "that's a complex one, let me think." It just means: wait. Imagine if it actually responded to the input — different for a casual question than for a serious one, saying "I heard you and I'm engaging with this specifically."

The tone before the sentence determines how the sentence is received. We learned this while developing an adaptive sound system for an automotive voice assistant. A gentle, ascending tone before the navigation instruction said "this is helpful information." A slightly more urgent tone before a warning said "pay attention." The words carried the content. The tones set the emotional frame — before a single word was spoken.

What speech can't do

Even the best TTS in the world has fundamental limits that non-speech sound doesn't:

Speech demands attention. When someone talks, you have to listen. You have to process language. Non-speech sound works peripherally — you register it without effort, without turning your head, without interrupting what you're doing. For a product that needs to communicate without interrupting, this is essential.

Speech is sequential. A sentence says one thing at a time. A sound can carry multiple dimensions simultaneously — function, state, urgency, brand identity — all in a single moment. The ear processes these in parallel. Speech asks you to follow a narrative. Sound gives you a snapshot.

Speech creates a social contract. The moment your product speaks, it enters a conversation. The user feels implicitly invited — or obligated — to respond. Non-speech sound communicates without creating that contract. The product can signal "I'm here, I noticed you, I'm ready" without opening a dialogue.

Speech can't be ambient. A voice assistant that talks when there's nothing to say is unbearable. But a product's acoustic presence shouldn't switch off between utterances. Non-speech sound can maintain a subtle, continuous presence without demanding a response.

Character is voice — and everything around it

The voice itself carries enormous character. Pitch, timbre, pacing, warmth — choosing and shaping the right voice for a product is a design decision as consequential as choosing its visual identity. This is where theatre experience and vocal direction matter: understanding how a voice creates trust, authority, or intimacy is a craft with centuries of history.

But the voice only speaks some of the time. The non-speech layer speaks all of the time. The way the product acknowledges your presence before it says a word. The way it signals that it's processing. The way it transitions between states. The way it fills the silence when there's nothing to say. Voice and non-speech sound together — designed as one coherent system — are what give an AI product a character that people can actually feel.

Your AI has a voice. Does it have a character?

We design the non-speech layer of intelligent products — the sounds around the words that give AI systems character, brand identity, and emotional presence. From concept to embedded system.

See our work → Get in touch →