Sesame AI: Advancing Conversational Voice Technology

The quest for truly human-like AI voice interaction has long been a challenge in the field of artificial intelligence. Traditional voice assistants often sound robotic, lack emotional range, and fail to maintain conversational context. Sesame AI is one of the companies working to overcome these limitations through their innovative approach to voice technology.

The Voice Presence Challenge

Sesame AI, founded by Brendan Iribe and a team of AI researchers, has defined their mission as achieving what they call "voice presence"—the quality that makes spoken interactions feel real, understood, and valued. They argue that voice is our most intimate medium as humans, carrying layers of meaning through countless variations in tone, pitch, rhythm, and emotion. According to the company, today's digital voice assistants lack essential qualities for true usefulness. A personal assistant that speaks only in a neutral tone struggles to find a permanent place in users' daily lives after the initial novelty wears off. This emotional flatness ultimately becomes exhausting rather than engaging.

The Conversational Speech Model Approach

At the heart of Sesame AI's technology is their Conversational Speech Model (CSM), which frames voice generation as an end-to-end multimodal learning task using transformers. Unlike traditional text-to-speech models that generate spoken output directly from text without contextual awareness, CSM leverages the history of conversations to produce more natural and coherent speech. The technical architecture of CSM includes:

A multimodal design that processes both text and speech
Two autoregressive transformers based on the Llama architecture
A split approach with a multimodal backbone and audio decoder
Processing of interleaved text and audio data

This architecture allows CSM to maintain awareness of conversational context and generate speech that appropriately reflects the flow of dialogue.

Current Capabilities and Limitations

Sesame AI offers a demonstration of their technology through voice assistants named Maya and Miles, available on their website. These demos showcase their progress in creating more expressive and contextually appropriate voice interactions. However, the company acknowledges several current limitations:

CSM is primarily trained on English data, with limited multilingual capabilities
It doesn't yet leverage pre-trained language models to their full potential
While it generates high-quality conversational prosody, it cannot fully model the structure of conversations (turn-taking, pacing, etc.)

In subjective testing using Comparative Mean Opinion Score (CMOS) studies, CSM-generated speech achieved parity with human recordings in terms of perceived naturalness when evaluated without context. However, when conversational context was included, human evaluators still preferred the original recordings, indicating room for improvement in context-sensitive speech generation.

Open Source Commitment

Sesame AI has stated a commitment to open-sourcing key components of their research under the Apache 2.0 license. This approach aims to enable the wider community to experiment with, build upon, and improve their technology. This collaborative stance reflects an understanding that advancing conversational AI requires coordinated effort across the field.

Future Directions

According to their research publications, Sesame AI's future work will focus on:

Scaling up model size and dataset volume
Expanding language support to over 20 languages
Exploring integration with pre-trained language models
Moving toward fully duplex models that can implicitly learn conversational dynamics

The company believes that truly natural voice interactions will require fundamental changes across the entire AI stack, from data curation to post-training methodologies.

Conclusion

Sesame AI represents an interesting approach to solving the limitations of current voice technology by focusing specifically on the conversational aspects of speech generation. While their technology is still evolving, their emphasis on achieving "voice presence" highlights an important direction for the future of voice AI. As voice interfaces become increasingly important in our interactions with technology, approaches like Sesame AI's that prioritize natural, contextually appropriate speech will be crucial in creating systems that feel genuinely helpful rather than merely functional. Book a Strategy Session →