Back to Nodes
Minimax Speech 02 HD

Minimax Speech 02 HD

Official

High-fidelity Text-to-Audio synthesis with emotional expression and multilingual support.

Nodespell AI
AI / Audio / Minimax

High-fidelity Text-to-Audio synthesis with emotional expression and multilingual support.

Model Overview

A powerful Text-to-Audio (T2A) model that excels at generating natural-sounding speech with a wide range of emotional expressions and multilingual capabilities. It's optimized for high-quality applications such as voiceovers, audiobooks, and virtual assistants.

Best At

Creating studio-quality voiceovers and audiobooks, producing natural dialogue for characters, generating multilingual content, and enabling dynamic voiceovers with emotional nuances.

Limitations / Not Good At

This model is not designed for real-time applications where extremely low latency is critical (consider the Speech-02-Turbo model for that). While it supports many languages, extremely specialized dialects or nuanced poetic readings might require fine-tuning or further testing.

Ideal Use Cases

  • Professional voiceovers for videos and advertisements 🎬
  • Generating audio for audiobooks and podcasts 🎧
  • Creating natural-sounding dialogue for games and animations 🎮
  • Building multilingual customer support bots 🌍
  • Developing accessibility features for content 🔊
  • Voice cloning for personalized audio experiences 👤

Input & Output Format

  • Input: Text prompt, voice ID, speed, volume, pitch, emotion, language settings, and normalization options.
  • Output: An audio file (URI).

Performance Notes

Optimized for high fidelity, meaning it prioritizes audio quality. While it offers excellent results, real-time performance might be slightly slower compared to models specifically designed for low latency.

Inputs (1)

Text

String

Text to convert to speech

Multi InputMin: 0Max: 100
Parameters (11)

Text

String

Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).

Default:

Pitch

Number

Speech pitch

Default: 0

Speed

Number

Speech speed

Default: 1

Volume

Number

Speech volume

Default: 1

Bitrate

Number

Bitrate for the generated speech

Default: 128000

Channel

String

Number of audio channels

Default: mono

Emotion

String

Speech emotion

Default: auto

Voice Id

String

Desired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl

Default: Wise_Woman

Sample Rate

Number

Sample rate for the generated speech

Default: 32000

Language Boost

String

Enhance recognition of specific languages and dialects

Default: None

English Normalization

Boolean

Enable English text normalization for better number reading (slightly increases latency)

Default: false
Outputs (1)

Output

Inferred

Output

Nodespell

Nodespell

📍 London

Building the future. Join us!

Type

Node

Status

Official

Package

Nodespell AI

Category

AI / Audio / Minimax

Input

Text

Output

Audio

Keywords

Text To SpeechVoice CloningMultimodal Generation
Use in Workflow