Together AI is making text-to-speech (TTS) models available on its AI Native Cloud platform, providing customers with exclusive access to top voice models on dedicated infrastructure.
The newest model available on Together AI is MiniMax Speech 2.6 Turbo.
Building a real-time voice agent usually forces an ugly choice: ship a voice that sounds convincingly human, or ship a voice that responds instantly and holds up in production. Most teams split the difference with a patchwork of providers: one for showcase experiences, another for low-latency turns, and others for cloning or global language coverage. Over time that patchwork becomes the product. Behavior diverges by market, latency and quality drift, and “upgrade the voice” turns into a cross-vendor infrastructure project instead of a product decision.
Starting today, Together AI, the AI Native Cloud, is the only platform where you can run MiniMax Speech 2.6 Turbo alongside your LLM and STT workloads on secure global infrastructure with SOC 2 Type II and HIPAA support compliance.
Naturalness and speed live on one platform instead of being traded off across vendors. MiniMax Speech 2.6 Turbo is benchmarked at the top of public TTS leaderboards, built by the team behind Talkie (150 million users averaging 90-plus minute sessions), and trained for real conversational interaction. Conversational latency stops being an integration tax—you get a single production surface for streaming delivery, capacity, and debugging with one API, one auth, and unified metrics, so the voice quality that wins users is the same voice stack you can reliably ship.
MiniMax ranks number at the top of Artificial Analysis Arena in blind human evaluation. The model is trained on Talkie’s conversation data, where 150 million users chose extended engagement with AI voice—90-plus minute average sessions. Most TTS models train on audiobook readers and podcast hosts. MiniMax learned from real conversation, which produces different prosody, pacing, and emotional range.
Teams building AI native voice products choose models where voice quality directly drives completion rates. A customer service agent can have correct intent recognition and LLM reasoning, but synthetic delivery causes disengagement. MiniMax is now available on Together AI with guaranteed performance isolation and infrastructure reliability that supports production workloads at scale.
MiniMax Speech 2.6 Turbo achieves sub-250ms latency on Together AI dedicated endpoints. When TTS runs alongside LLM and STT workloads on the same infrastructure, you eliminate cross-vendor network overhead. The complete pipeline from speech recognition through reasoning to synthesis stays fast enough for real-time conversation.
Native-quality speech across more than 40 major global languages with streaming inline language switching. English, Japanese, Spanish, Mandarin, French, German mid-sentence with authentic accents. The model detects language boundaries and switches with native pronunciation in real time.
Audiobooks, e-learning courses, and podcast narration where voice quality determines completion rates. Talkie’s 90-plus minute average sessions demonstrate that MiniMax voices hold attention. 10-second voice cloning means one narrator voice scales across 40-plus languages with native pronunciation. Deploy content generation workloads on Together AI infrastructure with the same reliability and observability as your other AI workloads.
Character voices for games, interactive fiction, and virtual companions. MiniMax Speech 2.6 Turbo delivered the expressiveness that made Talkie successful with 150 million users. Automatic emotional intelligence means characters respond naturally to conversation context. 10-second cloning enables rapid prototyping of character voices. Deploy gaming voice infrastructure on Together AI dedicated endpoints for guaranteed performance during traffic spikes.
MiniMax Speech 2.6 Turbo expand the elite proprietary TTS catalog on Together AI alongside Cartesia and Rime models, giving developers the most extensive selection on a single cloud, co-located with their LLM and STT workloads.
Using Together AI gives developers access to extensive multimodal models on a single platform, to maximize ease, speed-to-production, simplicity and scalability, the company said.
These models are delivered from MiniMax and Rime, offering the most natural-sounding, low-latency voice experiences on production-grade, compliant infrastructure to optimize performance, reliability, and control. They optimize a variety of use cases, including gaming (e.g. avatars), entertainment (e.g. voice dubbing, character voices), customer service and support, and content creation (e.g. ebooks and elearning), to name a few.
Rishabh Bhargava, director of machine learning at Together AI, said in an interview with GamesBeat, “At Together AI, we’re building the AI native cloud, which is purpose-built for all these high-growth AI native companies. These are companies that are early in their journey, but are already category-defining. Some of our customers are building the future of coding and software engineering, or a company like Decagon, which is building AI customer service agents at scale.
He added, “The reason why Together AI exists is to serve these folks. Frankly, many of these companies are incredibly infrastructure-bottlenecked. You’re seeing them grow super rapidly. They have soaring compute costs, and they’re trying to do all of this scaling while building low-latency, high-performance AI applications.”
He noted that Together AI is built on a lot of deep systems, which are research-optimized for Nvidia GPUs.
“We have hardware optimization, a software stack that delivers very high speed inference, model customization that these customers want,” Bhargava said. “Voice is an incredibly exciting new frontier. There are billions of phone calls that happen in the U.S. every single day. But historically, it’s been tough for AI models. And the reason is the naturalness of voice that you need — the emotional indicators on how you talk to somebody when they’re happy or sad — that has been typically lacking for a lot of AI models.”
“Our customers, these AI natives, are building these AI applications that where voice is an important part, and they come to us with the same needs, which is for us to build these AI applications and agents that are low latency, that are reliable, and were cost doesn’t break the bank,” Bhargava said.
What’s new

· MiniMax Speech 2.6 Turbo: Top ranked on Artificial Analysis Arena leaderboard. Multilingual TTS with 40+ languages, zero-shot cloning, automatic emotional awareness.
· Two enterprise-grade Rime models on Together AI: Arcana v2 for expressivity, Mist v2 for pronunciation control.
· Available now on Together AI with enterprise reliability for production-grade deployments.
High-quality text-to-speech has moved from a user-interface flourish to core infrastructure in modern AI products, so teams judge it on latency, reliability, and control rather than novelty, the company said.
The strongest models are still scattered across separate providers, each with its own API, auth, quotas, and pricing, so evaluating or combining them turns into integration work instead of configuration.
Latency and quality issues end up debugged across multiple vendor dashboards with no single view of the pipeline. Most teams either live with brittle multi-vendor voice stacks or lock into a single acceptable model because the cost of switching is too high, Together AI said.
Together AI said it is the only platform that gives developers this breadth of TTS models on a single production cloud, co-located with their LLM and STT workloads and exposed through a single API.
The company said it already provides serverless open source TTS through Orpheus and Kokoro. Today, the company is adding MiniMax Speech 2.6 Turbo and Rime, so teams can choose from cost-efficient open models to best-in-class proprietary systems without adding vendors.
Developers can build complete voice applications from speech recognition through LLM reasoning to TTS output on dedicated production capacity, with streaming over HTTP or WebSocket and autoscaling, metrics, and tracing that match the rest of their stack. All usage flows through a single control plane for auth, billing, and rate limits across LLM, STT, and TTS, so production voice becomes another knob on the same platform instead of a separate system to run.
MiniMax Speech 2.6 Turbo: Multilingual, Expressive, Real Time
MiniMax Speech 2.6 Turbo ranks at the top of Artificial Analysis Arena’s public TTS leaderboard in blind human evaluation. It delivers native quality speech in more than 40 languages with automatic inline language switching and sub 250 millisecond time to first byte, making it suitable for real time conversational use. The model comes from the team behind Talkie, a conversation app with over 150 million users who average more than 90 minute sessions, then adapted for enterprise grade reliability.
Cross lingual voice cloning
Given a short audio sample, MiniMax can clone a voice that now speaks more than 40 languages with native accents. There is no per language training and no manual transcription step. You can switch languages mid sentence and maintain consistent timbre and prosody while keeping latency under 200 milliseconds. Enterprise cross lingual cloning is available through sales.
Automatic emotional awareness
MiniMax reads semantic context and adjusts tone automatically. When your LLM outputs phrases like “I am sorry to hear that,” the model renders an empathetic delivery without prompt engineering or SSML tags. The same contextual reading powers inline language switching, so “Hello こんにちは hola” is spoken with appropriate pronunciation for each language in a single stream.
Production-ready performance:
- More than 40 languages with inline code switching
- Sub 250 millisecond time to first byte
- WebSocket streaming for real time applications
- Zero shot cloning, no fine tuning required
Rime: Enterprise Control at Scale
Arcana v2: Expressivity for enterprise conversations
Arcana v2 is deployed today from high-growth startups to Fortune 500s as part of their production infrastructure. Across these environments, customers report measurable gains including15% lift in sales at a national restaurant chain, a 75% reduction in call abandonment at a telecom provider, and a 10% increase in call success rates.
Trained on the largest proprietary dataset of full-duplex conversational speech data
Arcana v2 is trained on real conversations with everyday people — not audiobooks, podcasts, or voiceover announcers. The model learns natural breathing, fillers, backchannel cues, and conversational pacing from production conversations. Callers recognize these patterns and stay in the automated flow longer, improving completion and containment rates.
40+ voices and regional dialects
Arcana v2 ships with more than 40 voices across English, Spanish, French, and German. English includes 18 voices spanning U.K., Australian, and Southern US accents. Spanish includes four primary and three bilingual voices. Everyday words match local usage automatically. For example, “schedule” is pronounced “SHED-ule” in U.K. English and “SKED-ule” in U.S. English.
Mist v2: Deterministic pronunciation at production scale
Mist v2 is designed for high-volume production environments where pronunciation accuracy must be guaranteed and already powers tens of millions of production calls each month. It delivers conversational quality while prioritizing latency and throughput for real-time systems.
Production grade latency
Mist v2 reaches about X ms p50 time to first audio on Together AI dedicated endpoints. Voice agents need total end-to- end latency under 700 ms to feel conversational, which means TTS must be fast enough to leave headroom for STT and LLM processing. When you co-locate Mist v2 with LLM and STT on Together AI, the entire pipeline from speech recognition through reasoning to synthesis stays within that budget, directly improving completion rates and user satisfaction.
Conversational realism
Like Arcana v2, Mist v2 is trained on real customer service calls. It preserves natural filler words, backchanneling, breathing patterns, and pacing while maintaining production throughput. This makes it suitable for high-volume scenarios where both realism and responsiveness are required.
Deterministic pronunciation control
Most TTS models guess pronunciation on each generation. Mistv2 is deterministic. You define how a word should sound once through the API, and that pronunciation holds across more than 40 voices, flows, and channels. No retraining and no per vendor hacks. When your agent mispronounces a product name, drug, or acronym, you correct it once and the fix applies everywhere. Deterministic pronunciation configuration for MistV2 is available today through our Sales team for production deployments; contact Sales to enable it for your environment.
English and Spanish with advanced pronunciation control
Mist v2 supports English and Spanish with deterministic pronunciation control. You specify how brand names, medication names, or technical terms should sound through the API, and Mist renders them consistently at conversational latency. If you need deterministic pronunciation at scale in Mist v2, contact Sales to enable it for your environment.
Proven at scale
Mist v2 serves tens of millions of calls monthly in production customer service and IVR environments. These are full-scale deployments where downtime or quality regression has direct revenue and compliance impact, not limited pilots.
Broadest TTS Selection: Open Source to Elite Proprietary
Together AI is the only platform that runs open source and elite proprietary TTS alongside LLM and STT on a single production cloud, with one API and control plane. Developers get natural sounding, low latency voices for customer service, content generation, entertainment, and gaming on the same infrastructure they already use for their generative stack.
- Open source serverless (Orpheus, Kokoro)
Cost efficient, fast to adopt, no dedicated infrastructure to manage. - Proprietary elite (Minimax, Rime)
Highest quality, enterprise controls, proven at scale. - Single auth and billing
One API key, one invoice, one set of rate limits across all models. - Unified observability
Debug latency and errors across the entire voice pipeline from a single dashboard. - Model flexibility
Swap TTS models with a single parameter change rather than a new integration. - Low latency at scale
Sub 200 millisecond latency with dedicated GPU deployments and no shared inference contention.
The company started in 2022 and it has been growing fast. The MiniMax partnership is important as it will cover a lot of different languages.
“We’re on track to bring more and more models to our platform to give developers flexibility,” Bhargava said.
The company has over 250 people and it has raised more than $500 million to date.
“Our goal is to build the best platform to serve these AI natives. And these customers grow extremely fast. Their needs are ever increasing, and we want to be the best platform for serving these models, for inference, for customization of models,” he said.
The company said it is making headway in gaming as more and more games are becoming immersive, with deeper narratives. That requires a lot of high-quality game characters and the ability to generate speech for specific voice actors in real time.
“It’s a very interesting use case that we end up seeing,” Bhargava said. “Our belief is this is just the very beginning. These models are starting to get there. And as production workloads start to ramp up, these AI native companies start to ramp up, they will need this high-quality, reliable infrastructure that we’ve been investing in. We’re still in the early innings and these models are just starting to get ready.”
Bhargava said the company has more than a million developers that are live on the platform today, and they span a fairly wide gamut. One example is Decadgon, which uses it for customer service applications.