Back to Nodes
Minimax Speech 02 HD

Minimax Speech 02 HD

Official

High-fidelity Text-to-Audio synthesis with emotional expression and multilingual support.

Nodespell AI
AI / Audio / Minimax

High-fidelity Text-to-Audio synthesis with emotional expression and multilingual support.

Model Overview

A powerful Text-to-Audio (T2A) model that excels at generating natural-sounding speech with a wide range of emotional expressions and multilingual capabilities. It's optimized for high-quality applications such as voiceovers, audiobooks, and virtual assistants.

Best At

Creating studio-quality voiceovers and audiobooks, producing natural dialogue for characters, generating multilingual content, and enabling dynamic voiceovers with emotional nuances.

Limitations / Not Good At

This model is not designed for real-time applications where extremely low latency is critical (consider the Speech-02-Turbo model for that). While it supports many languages, extremely specialized dialects or nuanced poetic readings might require fine-tuning or further testing.

Ideal Use Cases

  • Professional voiceovers for videos and advertisements 🎬
  • Generating audio for audiobooks and podcasts 🎧
  • Creating natural-sounding dialogue for games and animations 🎮
  • Building multilingual customer support bots 🌍
  • Developing accessibility features for content 🔊
  • Voice cloning for personalized audio experiences 👤

Input & Output Format

  • Input: Text prompt, voice ID, speed, volume, pitch, emotion, language settings, and normalization options.
  • Output: An audio file (URI).

Performance Notes

Optimized for high fidelity, meaning it prioritizes audio quality. While it offers excellent results, real-time performance might be slightly slower compared to models specifically designed for low latency.

Model Examples (4)

Example Index01 / 04
Example 01

Prestige-series teaser

Trailer-style narration for a dramatic series promo.

Source Inputs01
Text

At first they called it an accident. Then the dailies came back. Every frame showed the same door, open three inches wider than before. This autumn, the footage tells its own story.

Parameters09
Text
At first they called it an accident. Then the dailies came back. Every frame showed the same door, open three inches wider than before. This autumn, the footage tells its own story.
Voice Id
Deep_Voice_Man
Emotion
neutral
Speed
0.95
Pitch
0
Volume
1
Channel
mono
Sample Rate
44100
Bitrate
128000
ttshigh-fidelity
Response
Inputs (1)

Text

String

Text to convert to speech

Multi InputMin: 0Max: 100
Parameters (11)

Text

String

Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).

Default:

Pitch

Number

Speech pitch

Default: 0

Speed

Number

Speech speed

Default: 1

Volume

Number

Speech volume

Default: 1

Bitrate

Number

Bitrate for the generated speech

Default: 128000

Channel

String

Number of audio channels

Default: mono

Emotion

String

Speech emotion

Default: auto

Voice Id

String

Desired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl

Default: Wise_Woman

Sample Rate

Number

Sample rate for the generated speech

Default: 32000

Language Boost

String

Enhance recognition of specific languages and dialects

Default: None

English Normalization

Boolean

Enable English text normalization for better number reading (slightly increases latency)

Default: false
Outputs (1)

Output

Inferred

Output

Nodespell

Nodespell

London

Building the future. Join us!

Creator profile

Type

Node

Status

Official

Package

Nodespell AI

Category

AI / Audio / Minimax

Input

Text

Output

Audio

Keywords

Text To SpeechVoice CloningMultimodal Generation
Use in Workflow