WorldReel

Abstract

Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.

Overview

Overview of WorldReel. We augment a video diffusion transformer with a geo-motion latent (from RGB and 2.5D cues such as depth/optical flow) to inject a 4D inductive bias for spatio-temporal consistency. A temporal DPT decoder is trained with direct supervision and regularization to predict unified 4D outputs (depth/point cloud, calibrated camera, 3D scene flow, and masks).

Unified 4D Videos

4D outputs jointly generated by WorldReel.

Input Image

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

4D Scene Rendering

Rendering results of the generated 4D scene.

Input Image

Generated Video

Rendered Scene

Input Image

Generated Video

Rendered Scene

Input Image

Generated Video

Rendered Scene

Input Image

Generated Video

Rendered Scene

In-the-wild Video Comparison

Comparison with other video generation methods.

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Generated In-the-wild Video

Qualitative samples generated by WorldReel on in-the-wild images.

Input Image (static scenes)

Generated Videos

Input Image (dynamic scenes)

Generated Videos

Ablation

Ablation study results.

We validate the effectiveness of our core designs: the geo-motion augmented latents and the joint optimization strategy.

As demonstrated in the visual comparison:

Base Finetuned: a finetuned version of the CogVideoX model. Results show unnatural motion and inconsistent appearance across frames.
w/o Geo-Motion: relying solely on the RGB-only model (without geo-motion latents) leads to noticeable visual artifacts and ghosting, degrading scene stability.
w/o Joint: skipping the joint training and regularization stage results in reduced subject consistency, causing incoherent motion particularly for dynamic objects and humans.
WorldReel (Ours): by effectively aligning appearance and geometry, our full model produces the smoothest non-rigid motion and maintains natural, high-quality complex dynamics.