Back to Nodes
Omni Human

Omni Human

Official

Generate professional‑quality animated human videos from a single image and audio clip

Nodespell AI
AI / Video / Bytedance

Generate professional‑quality animated human videos from a single image and audio clip

Model Overview

OmniHuman turns a still photograph of a person and an accompanying audio clip into a realistic, motion‑based video. By conditioning on the image’s pose and the audio’s rhythm, it creates a high‑fidelity animation that captures facial expressions, lip‑sync and body movement.

Best At

  • Quickly produce polished avatar videos for social media, branded content or educational videos.
  • Works well with short, high‑quality audio (≤15 s) – the model keeps sync and motion realistic.
  • Handles any aspect ratio image (portrait, half‑body, full‑body) and adapts video output accordingly.
  • Supports a wide range of styles, from realistic portraits to cartoon‑like characters.

Limitations / Not Good At

  • Audio longer than 15 s begins to degrade video quality, and the model is not designed for long‑form content.
  • Requires a clear, high‑resolution reference image – low‑quality or heavily occluded images produce blurry or mismatched results.
  • Complex lighting or extreme poses may not translate perfectly, as the model relies heavily on learned motion patterns.
  • Not suited for driving the video with external video input; it does not accept a video as a locator of the driving sequence.

Ideal Use Cases

  • Short TikTok or Reels animations with a brand avatar or influencer.
  • Product showcase videos where a spokesperson appears animated.
  • Educational or training clips featuring an animated presenter.
  • Marketing promos that need a quick, polished video without a lot of editing.

Input & Output Format

  • Inputs:
    • image – URL or file path to a human image (any aspect ratio).
    • audio – URL or file path to an MP3/WAV clip (best quality under 15 s).
  • Output: a URI pointing to the generated MP4 video.

Performance Notes

  • Generates a single clip relatively quickly (minutes for a 15 s video on typical GPU nodes).
  • Video quality scales with audio length; shorter clips produce cleaner results.
  • The process is GPU‑intensive due to rendering; single prompts are fast, batch processing is more resource‑heavy.
Inputs (2)

Image

String

Input image containing a human subject, face or character.

Min: 0Max: 100

Audio

String

Input audio file (MP3, WAV, etc.). For the best quality outputs audio should be no longer than 15 seconds. After 15 seconds the video quality will begin to degrade. If you have a lot of audio you want to process, we recommend splitting it into 15 second chunks.

Min: 0Max: 100
Outputs (1)

Output

Inferred

Output

Nodespell

Nodespell

📍 London

Building the future. Join us!

Type

Node

Status

Official

Package

Nodespell AI

Category

AI / Video / Bytedance

Input

ImageAudio

Output

Video

Keywords

Video GenerationAspect ControlLength Control
Use in Workflow