Video Diffusion Alignment via Reward Gradient

Mihir Prabhudesai*         Zheyang Qin*         Russell Mendonca*         Katerina Fragkiadaki         Deepak Pathak
Carnegie Mellon University

Abstract

We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.

Aesthetic and HPS Reward

PickScore Reward

HPS Reward

Object Removal Reward

Removing books with the YOLOS object detection model.

V-JEPA Reward

Improve temporal consistency for Stable Video Diffusion, an image-to-video model.

Aesthetic and ViCLIP Reward

Improve text-video alignment for VideoCrafter2.

Aesthetic and ViCLIP Reward Curve

Training curve for Videocrater-based VADER fine-tuned using Aesthetic and ViCLIP Reward.

Reward Curve

Training Efficiency Comparison

Reward Curve

Training efficiency comparison against various baselines, when trained for longer. The base model used here is ModelScope. We compare VADER against DPO, DDPO, on-policy DPO, and on-policy DDPO. For implementing the on-policy version of the baselines, we simply reduce the UTD (update-to-data) ratio to 1, thus only doing a single gradient update for each datapoint sampled. We observe that VADER significantly outperforms all of them in terms of compute efficiency.

Diversity Test

VideoCrafter2 VADER-PickScore Reward VADER-Aesthetic and HPS Reward VADER-Aesthetic and ViCLIP Reward
Average Variance 0.0037 0.0026 0.0023 0.0031

Diversity of generated videos for VADER. The base model is VideoCrafter2. We generate 500 videos for each model and prompt combination. We use 5 prompts resulting in a total of 2500 videos per model. The diversity is calculated using the variance of VideoMAE latent space embeddings across the 500 videos for each prompt. We then average the variances over all prompts. We find that VADER variants exhibit reduced diversity compared to the baseline model VideoCrafter2. Prior works Robert Kirk, et al. and Sonia K Murthy, et al. have found similar results, where aligning a model for a specific use case often results in reduced diversity. Some Visualization are exhibited in Diversity Gallery.

Memory Usage Comparison

Method VRAM System RAM Total RAM
LoRA + Mixed Precision 12.1 GB 264.2 GB 276.3 GB
+ Subsampling Frames 12.1 GB 216.8 GB 228.9 GB
+ Truncated Backpropagation 12.1 GB 57.3 GB 69.4 GB
+ Gradient Checkpointing 12.1 GB 20.4 GB 32.5 GB

Ablation of memory usage for different components in ModelScope-based VADER. For this experiment, we offload the memory to the CPU main memory to prevent GPU out-of-memory error. Starting with the standard LoRA + Mixed Precision, each row represents an added component (Subsampling Frames, Truncated Backpropagation, and Gradient Checkpointing) applied incrementally to the previous row. The total RAM reduced is about 240 GB after implementing all the steps.

Reward Correlation

Model HPS Score PickScore Score Aesthetic Score ViCLIP Score
VideoCrafter2 0.2564 20.9231 5.2219 0.2643
VADER-Aesthetic and HPS 0.2651 21.1345 5.7965 0.2622
VADER-PickScore 0.2669 21.4911 5.5757 0.2640
VADER-Aesthetic and ViCLIP 0.2511 20.8927 5.6241 0.2628
Reward Curve

The base model is VideoCrafter2. In this Table, we study how optimizing for specific reward functions via VADER affects scores on other reward functions. We observe that the HPS score increases significantly after fine-tuning the base model via the PickScore model, indicating a strong correlation. We study correlation across different reward models for VADER. We find that there is a strong positive correlation between PickScore and HPS scores, while a strong negative correlation between ViCLIP and Aesthetic reward function.

EvalCrater Evaluation

Evaluated on VideoCrafter-based VADER.

Model Temporal Coherence Motion Quality
VideoCrafter2 55.90 52.89
T2V Turbo (4 Steps) 57.10 54.93
T2V Turbo (8 Steps) 57.05 55.34
VADER-Aesthetic and HPS 59.65 55.46
VADER-PickScore 60.75 54.65
VADER-Aesthetic and ViCLIP 57.08 54.25

EvalCrafter evaluation results for VADER. EvalCrafter calculates Temporal Coherence using Warping Error, Semantic Consistency (cosine similarity of the embeddings of consecutive frames), and Face Consistency, which assess frame-wise pixel and semantic consistency. Motion Quality is evaluated through Action-Score (action classification accuracy), Flow-Score (average optical flow between frames obtained from RAFT), and Motion AC-Score (amplitude classification consistency with the text prompt). We generate 700 videos from each model for this comparision. Results demonstrate that all the VADER-variants outperform the base model (VideoCrafter2).

VBench Evaluation (EvalCrafter Prompts)

Model Subject Consistency Background Consistency Motion Smoothness Dynamic Degree Aesthetic Quality Imaging Quality Weighted Average
VideoCrafter2 0.9544 0.9652 0.9688 0.5346 0.5752 0.6677 0.7997
T2V Turbo (4 Steps) 0.9639 0.9656 0.9562 0.4771 0.6183 0.7266 0.8126
T2V Turbo (8 Steps) 0.9735 0.9736 0.9572 0.3686 0.6265 0.7168 0.8058
VADER-Aesthetic and HPS 0.9659 0.9713 0.9734 0.4741 0.6295 0.7145 0.8167
VADER-PickScore 0.9668 0.9727 0.9726 0.3732 0.6094 0.6762 0.7971
VADER-Aesthetic and ViCLIP 0.9564 0.9662 0.9714 0.5519 0.6008 0.6566 0.8050

VBench evaluation results VADER using EvalCrafter Prompts. The base model is VideoCrafter2. The metrics used in VBench include: Subject Consistency (consistency of the main subject across frames, evaluated using DINO feature similarity), Background Consistency (using CLIP feature similarity), Motion Smoothness (fluidity of motion, based on motion priors from a frame interpolation model), Dynamic Degree (extent of motion in the video, estimated with RAFT), Aesthetic Quality (assessed via the LAION aesthetic predictor), and Imaging Quality (using MUSIQ). The weighted average is calculated by assigning weights to each metric. All metrics are given a weight of 1, except for the Dynamic Degree metric, which is assigned a weight of 0.5. We generate 700 videos for each model using EvalCrafter prompts that are not seen during training. We find that VADER-Pick has the best consistency score, while VADER-HPS shows the best aesthetic and imaging quality. We find VADER-HPS performs the best overall.

VBench Evaluation (Standard Prompt Suite)

Model Subject Consistency Background Consistency Motion Smoothness Dynamic Degree Aesthetic Quality Imaging Quality Temporal Flickering Quality Score
VideoCrafter2 96.85 98.22 97.73 42.50 63.13 67.22 98.41 82.20
Pika 96.76 98.95 99.51 37.22 63.15 62.33 99.77 82.68
Gen-2 97.61 97.61 99.58 18.89 66.96 67.42 99.56 82.47
T2V Turbo (VC2) 96.28 97.02 97.34 49.17 63.04 72.49 97.48 82.57
VADER-Aesthetic and HPS 95.79 96.71 97.06 66.94 67.04 69.93 98.19 84.15

VBench evaluation results VADER using standard prompt suite. The base model is VideoCrafter2, Pika (2023-9), Gen-2 (2023-12), and T2V Turbo (VC2). The metrics used in VBench include: Subject Consistency (consistency of the main subject across frames, evaluated using DINO feature similarity), Background Consistency (using CLIP feature similarity), Motion Smoothness (fluidity of motion, based on motion priors from a frame interpolation model), Dynamic Degree (extent of motion in the video, estimated with RAFT), Aesthetic Quality (assessed via the LAION aesthetic predictor), Imaging Quality (using MUSIQ), and Temporal Flickering (computing the mean absolute difference across frames). The Quality Score is calculated by assigning weights to each normalized metric. All metrics are given a weight of 1, except for the Dynamic Degree metric, which is assigned a weight of 0.5. This follows VBench metrics. We find VADER-HPS surpass all baselines in terms of Quality Score, Aesthetic Quality and Dynamic Degree.

VADER-V-JEPA Evaluation

Model Subject Consistency Background Consistency Motion Smoothness Dynamic Degree Aesthetic Quality Imaging Quality
Stable Video Diffusion 0.9042 0.9469 0.9634 0.8333 0.6782 0.6228
VADER-V-JEPA 0.9401 0.9551 0.9669 0.8333 0.6807 0.6384

VBench evaluation results for Image to Video diffusion models. The base model is Stable Video Diffusion. We compare Stable Video Diffusion and VADER-V-JEPA. VADER-V-JEPA demonstrates improvements across most metrics, particularly in consistency and aesthetic quality.

Truncated Backpropagation Ablation

Training Step Reward Value (k=1) Reward Value (k=10)
1 5.047 5.0946
100 5.3342 5.2523
200 5.4977 5.2072
300 5.6479 5.1906

From left to right, we demenstrate videos for training steps 1, 100, 200, and 300.

k=1
k=1
k=10
k=10

We ablate the number of truncated backpropagation steps (K) in VADER. For this experiment, we use VADER trained using Aesthetic and HPS Rewards. The base model is VideoCrafter2. We find that higher values of K result in more semantic level changes, while K=1 results in more fine-grained changes, specifically in the earlier steps of training. Further, we find as we train longer, both the models start exhibiting semantic level changes. We also find it is easier to optimize with a smaller value of K, as can be seen in the results below.

DOODL V.S. VADER

We use Aesthetic and HPS reward function to optimize models. Base model of VADER is VideoCrafter2.

DOODL (2 GPU minutes per sample) DOODL (20 GPU minutes per sample) VADER (12 GPU hours of training)
Reward 4.9583 4.9687 5.2810

VBench Distilled Reward Model

Model Subject Consistency Background Consistency Motion Smoothness Dynamic Degree Aesthetic Quality Imaging Quality Weighted Average
VideoCrafter2 0.9544 0.9652 0.9688 0.5346 0.5752 0.6677 0.7997
T2V Turbo (4 Steps) 0.9639 0.9656 0.9562 0.4771 0.6183 0.7266 0.8126
T2V Turbo (8 Steps) 0.9735 0.9736 0.9572 0.3686 0.6265 0.7168 0.8058
VADER-Aesthetic and HPS 0.9659 0.9713 0.9734 0.4741 0.6295 0.7145 0.8167
VADER-PickScore 0.9668 0.9727 0.9726 0.3732 0.6094 0.6762 0.7971
VADER-Aesthetic and ViCLIP 0.9564 0.9662 0.9714 0.5519 0.6008 0.6566 0.8050
VADER-VBench 0.9638 0.9678 0.9691 0.5361 0.6393 0.7231 0.8238

Training Plots of VBench Distilled Reward Model

Training Curve of Reward Model
Training Curve of Reward Model

Training Curves for VBench Distilled Reward Model. The plot on the left shows the average training and validation loss curves over epochs, showcasing the model's convergence behavior. The figure on the right depicts the average validation accuracy for each epoch, highlighting the model's performance in ranking the preference of video pairs in terms of every metrics, including background consistency, dynamic degree, image quality, motion smoothness, aesthetic quality, PickScore value, HPS score, and subject consistency.

Reward Training Curve of VADER-VBench

Training Curve of Reward Model

The reward curve when training VADER using the VBench Distilled reward model.

More Videos from VADER are exhibited in Video Gallery.