AI-powered video object removal
using Netflix VOID โ removing people, objects, shadows, and reflections from footage.
Netflix
VOID โ Video Object & Interaction Deletion
VOID uses a fine-tuned
CogVideoX-Fun diffusion model to physically remove objects from video. Unlike simple inpainting,
it reasons about shadows, reflections, and physical interactions caused by the removed object,
producing plausible clean plates.
VLM Mask
Reasoner โ Automatic Quadmask Generation
Uses SAM2 segmentation +
Gemini VLM analysis to automatically identify objects and their interaction regions (shadows,
reflections). Produces a 4-value quadmask encoding: primary object (0), overlap (63), affected
regions (127), and background (255).
Pipeline
Stages
- VLM Mask Reasoner โ Auto-segments people
and objects, identifies shadow/reflection regions
- VOID Pass 1 โ Diffusion-based inpainting
at 384ร672 native resolution (30 denoising steps)
- FHD Upscale โ Lanczos upscale to 1920ร1080
for delivery
โ ๏ธ
Resolution Limitation
VOID's native inference
resolution is 384ร672. Output is upscaled to
1920ร1080 for delivery, but there will be visible quality degradation compared to the original
footage. This is inherent to the current model architecture. Best results on clips where the
subject occupies a moderate portion of the frame.
Output
Files
- Clean Plate โ The inpainted video with
objects removed, upscaled to FHD
- Comparison โ 2ร2 grid: input | output |
masked input | error map