Omni Human
Generate professional‑quality animated human videos from a single image and audio clip
Generate professional‑quality animated human videos from a single image and audio clip
Model Overview
OmniHuman turns a still photograph of a person and an accompanying audio clip into a realistic, motion‑based video. By conditioning on the image’s pose and the audio’s rhythm, it creates a high‑fidelity animation that captures facial expressions, lip‑sync and body movement.
Best At
- Quickly produce polished avatar videos for social media, branded content or educational videos.
- Works well with short, high‑quality audio (≤15 s) – the model keeps sync and motion realistic.
- Handles any aspect ratio image (portrait, half‑body, full‑body) and adapts video output accordingly.
- Supports a wide range of styles, from realistic portraits to cartoon‑like characters.
Limitations / Not Good At
- Audio longer than 15 s begins to degrade video quality, and the model is not designed for long‑form content.
- Requires a clear, high‑resolution reference image – low‑quality or heavily occluded images produce blurry or mismatched results.
- Complex lighting or extreme poses may not translate perfectly, as the model relies heavily on learned motion patterns.
- Not suited for driving the video with external video input; it does not accept a video as a locator of the driving sequence.
Ideal Use Cases
- Short TikTok or Reels animations with a brand avatar or influencer.
- Product showcase videos where a spokesperson appears animated.
- Educational or training clips featuring an animated presenter.
- Marketing promos that need a quick, polished video without a lot of editing.
Input & Output Format
- Inputs:
image– URL or file path to a human image (any aspect ratio).audio– URL or file path to an MP3/WAV clip (best quality under 15 s).
- Output: a URI pointing to the generated MP4 video.
Performance Notes
- Generates a single clip relatively quickly (minutes for a 15 s video on typical GPU nodes).
- Video quality scales with audio length; shorter clips produce cleaner results.
- The process is GPU‑intensive due to rendering; single prompts are fast, batch processing is more resource‑heavy.
Image
StringInput image containing a human subject, face or character.
Audio
StringInput audio file (MP3, WAV, etc.). For the best quality outputs audio should be no longer than 15 seconds. After 15 seconds the video quality will begin to degrade. If you have a lot of audio you want to process, we recommend splitting it into 15 second chunks.
Output
InferredOutput
Type
Node
Status
Official
Package
Nodespell AI
Category
AI / Video / BytedanceInput
Output