Minimax Speech 02 HD
High-fidelity Text-to-Audio synthesis with emotional expression and multilingual support.
High-fidelity Text-to-Audio synthesis with emotional expression and multilingual support.
Model Overview
A powerful Text-to-Audio (T2A) model that excels at generating natural-sounding speech with a wide range of emotional expressions and multilingual capabilities. It's optimized for high-quality applications such as voiceovers, audiobooks, and virtual assistants.
Best At
Creating studio-quality voiceovers and audiobooks, producing natural dialogue for characters, generating multilingual content, and enabling dynamic voiceovers with emotional nuances.
Limitations / Not Good At
This model is not designed for real-time applications where extremely low latency is critical (consider the Speech-02-Turbo model for that). While it supports many languages, extremely specialized dialects or nuanced poetic readings might require fine-tuning or further testing.
Ideal Use Cases
- Professional voiceovers for videos and advertisements 🎬
- Generating audio for audiobooks and podcasts 🎧
- Creating natural-sounding dialogue for games and animations 🎮
- Building multilingual customer support bots 🌍
- Developing accessibility features for content 🔊
- Voice cloning for personalized audio experiences 👤
Input & Output Format
- Input: Text prompt, voice ID, speed, volume, pitch, emotion, language settings, and normalization options.
- Output: An audio file (URI).
Performance Notes
Optimized for high fidelity, meaning it prioritizes audio quality. While it offers excellent results, real-time performance might be slightly slower compared to models specifically designed for low latency.
Text
StringText to convert to speech
Text
StringText to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).
Pitch
NumberSpeech pitch
0Speed
NumberSpeech speed
1Volume
NumberSpeech volume
1Bitrate
NumberBitrate for the generated speech
128000Channel
StringNumber of audio channels
monoEmotion
StringSpeech emotion
autoVoice Id
StringDesired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl
Wise_WomanSample Rate
NumberSample rate for the generated speech
32000Language Boost
StringEnhance recognition of specific languages and dialects
NoneEnglish Normalization
BooleanEnable English text normalization for better number reading (slightly increases latency)
falseOutput
InferredOutput
Type
Node
Status
Official
Package
Nodespell AI
Category
AI / Audio / MinimaxInput
Output