WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

Shaoheng Fang1, Hanwen Jiang2, Yunpeng Bai1, Niloy J. Mitra2,3, Qixing Huang1
1The University of Texas at Austin 2Adobe Research 3University College London
ArXiv Paper (PDF) Code (coming soon)
Abstract

Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.

Overview
WorldReel method overview

Overview of WorldReel. We augment a video diffusion transformer with a geo-motion latent (from RGB and 2.5D cues such as depth/optical flow) to inject a 4D inductive bias for spatio-temporal consistency. A temporal DPT decoder is trained with direct supervision and regularization to predict unified 4D outputs (depth/point cloud, calibrated camera, 3D scene flow, and masks).

Unified 4D Videos 4D Scene Rendering In-the-wild Video Comparison Generated In-the-wild Video Ablation
Unified 4D Videos

4D outputs jointly generated by WorldReel.

Input Image

Input Image 4

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Input Image 5

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Input Image 2

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Input Image 3

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

Input Image

Input Image 6

Generated Video

Generated Depth

Generated Optical Flow

Generated Scene Flow

Generated Mask

4D Scene Rendering

Rendering results of the generated 4D scene.

Input Image

Input Image 1

Generated Video

Rendered Scene

Input Image

Input Image 2

Generated Video

Rendered Scene

Input Image

Input Image 3

Generated Video

Rendered Scene

Input Image

Input Image 4

Generated Video

Rendered Scene

In-the-wild Video Comparison

Comparison with other video generation methods.

Input Image

Input Image 1

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

Input Image 2

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

Input Image 3

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

Input Image 4

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

Input Image 5

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

Input Image 6

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Input Image

Input Image 7

CogVideoX

4DNeX

DimensionX

GeoVideo

WorldReel (ours)

Generated In-the-wild Video

Qualitative samples generated by WorldReel on in-the-wild images.

 

Input 1

 

Input 2

Input Image (static scenes)

Input 3

 

Input 4

 

Input 5

 

 

Generated Videos

 

 

 

Input 6

 

Input 7

Input Image (dynamic scenes)

Input 8

 

Input 9

 

Input 10

 

 

Generated Videos

 

 

Ablation

Ablation study results.

We validate the effectiveness of our core designs: the geo-motion augmented latents and the joint optimization strategy.

As demonstrated in the visual comparison:

Input Image

Input Image 1

Base Finetuned

w/o Geo-Motion

w/o Joint

WorldReel (ours)

Input Image

Input Image 2

Base Finetuned

w/o Geo-Motion

w/o Joint

WorldReel (ours)

Input Image

Input Image 3

Base Finetuned

w/o Geo-Motion

w/o Joint

WorldReel (ours)

BibTeX
@article{fang2025worldreel,
      title={WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling},
      author={Fang, Shaoheng and Jiang, Hanwen and Bai, Yunpeng and Mitra, Niloy J. and Huang, Qixing},
      journal={arXiv preprint arXiv:2512.07821},
      year={2025}
    }