Removing books with the YOLOS object detection model.
Improve temporal consistency for Stable Video Diffusion, an image-to-video model.
Improve text-video alignment for VideoCrafter2.
Training curve for Videocrater-based VADER fine-tuned using Aesthetic and ViCLIP Reward.
Training efficiency comparison against various baselines, when trained for longer. The base model used here is ModelScope. We compare VADER against DPO, DDPO, on-policy DPO, and on-policy DDPO. For implementing the on-policy version of the baselines, we simply reduce the UTD (update-to-data) ratio to 1, thus only doing a single gradient update for each datapoint sampled. We observe that VADER significantly outperforms all of them in terms of compute efficiency.
VideoCrafter2 | VADER-PickScore Reward | VADER-Aesthetic and HPS Reward | VADER-Aesthetic and ViCLIP Reward | |
---|---|---|---|---|
Average Variance | 0.0037 | 0.0026 | 0.0023 | 0.0031 |
Diversity of generated videos for VADER. The base model is VideoCrafter2. We generate 500 videos for each model and prompt combination. We use 5 prompts resulting in a total of 2500 videos per model. The diversity is calculated using the variance of VideoMAE latent space embeddings across the 500 videos for each prompt. We then average the variances over all prompts. We find that VADER variants exhibit reduced diversity compared to the baseline model VideoCrafter2. Prior works Robert Kirk, et al. and Sonia K Murthy, et al. have found similar results, where aligning a model for a specific use case often results in reduced diversity. Some Visualization are exhibited in Diversity Gallery.
Method | VRAM | System RAM | Total RAM |
---|---|---|---|
LoRA + Mixed Precision | 12.1 GB | 264.2 GB | 276.3 GB |
+ Subsampling Frames | 12.1 GB | 216.8 GB | 228.9 GB |
+ Truncated Backpropagation | 12.1 GB | 57.3 GB | 69.4 GB |
+ Gradient Checkpointing | 12.1 GB | 20.4 GB | 32.5 GB |
Ablation of memory usage for different components in ModelScope-based VADER. For this experiment, we offload the memory to the CPU main memory to prevent GPU out-of-memory error. Starting with the standard LoRA + Mixed Precision, each row represents an added component (Subsampling Frames, Truncated Backpropagation, and Gradient Checkpointing) applied incrementally to the previous row. The total RAM reduced is about 240 GB after implementing all the steps.
Model | HPS Score | PickScore Score | Aesthetic Score | ViCLIP Score |
---|---|---|---|---|
VideoCrafter2 | 0.2564 | 20.9231 | 5.2219 | 0.2643 |
VADER-Aesthetic and HPS | 0.2651 | 21.1345 | 5.7965 | 0.2622 |
VADER-PickScore | 0.2669 | 21.4911 | 5.5757 | 0.2640 |
VADER-Aesthetic and ViCLIP | 0.2511 | 20.8927 | 5.6241 | 0.2628 |
The base model is VideoCrafter2. In this Table, we study how optimizing for specific reward functions via VADER affects scores on other reward functions. We observe that the HPS score increases significantly after fine-tuning the base model via the PickScore model, indicating a strong correlation. We study correlation across different reward models for VADER. We find that there is a strong positive correlation between PickScore and HPS scores, while a strong negative correlation between ViCLIP and Aesthetic reward function.
Evaluated on VideoCrafter-based VADER.
Model | Temporal Coherence | Motion Quality |
---|---|---|
VideoCrafter2 | 55.90 | 52.89 |
T2V Turbo (4 Steps) | 57.10 | 54.93 |
T2V Turbo (8 Steps) | 57.05 | 55.34 |
VADER-Aesthetic and HPS | 59.65 | 55.46 |
VADER-PickScore | 60.75 | 54.65 |
VADER-Aesthetic and ViCLIP | 57.08 | 54.25 |
EvalCrafter evaluation results for VADER. EvalCrafter calculates Temporal Coherence using Warping Error, Semantic Consistency (cosine similarity of the embeddings of consecutive frames), and Face Consistency, which assess frame-wise pixel and semantic consistency. Motion Quality is evaluated through Action-Score (action classification accuracy), Flow-Score (average optical flow between frames obtained from RAFT), and Motion AC-Score (amplitude classification consistency with the text prompt). We generate 700 videos from each model for this comparision. Results demonstrate that all the VADER-variants outperform the base model (VideoCrafter2).
Model | Subject Consistency | Background Consistency | Motion Smoothness | Dynamic Degree | Aesthetic Quality | Imaging Quality | Weighted Average |
---|---|---|---|---|---|---|---|
VideoCrafter2 | 0.9544 | 0.9652 | 0.9688 | 0.5346 | 0.5752 | 0.6677 | 0.7997 |
T2V Turbo (4 Steps) | 0.9639 | 0.9656 | 0.9562 | 0.4771 | 0.6183 | 0.7266 | 0.8126 | T2V Turbo (8 Steps) | 0.9735 | 0.9736 | 0.9572 | 0.3686 | 0.6265 | 0.7168 | 0.8058 |
VADER-Aesthetic and HPS | 0.9659 | 0.9713 | 0.9734 | 0.4741 | 0.6295 | 0.7145 | 0.8167 |
VADER-PickScore | 0.9668 | 0.9727 | 0.9726 | 0.3732 | 0.6094 | 0.6762 | 0.7971 |
VADER-Aesthetic and ViCLIP | 0.9564 | 0.9662 | 0.9714 | 0.5519 | 0.6008 | 0.6566 | 0.8050 |
VBench evaluation results VADER using EvalCrafter Prompts. The base model is VideoCrafter2. The metrics used in VBench include: Subject Consistency (consistency of the main subject across frames, evaluated using DINO feature similarity), Background Consistency (using CLIP feature similarity), Motion Smoothness (fluidity of motion, based on motion priors from a frame interpolation model), Dynamic Degree (extent of motion in the video, estimated with RAFT), Aesthetic Quality (assessed via the LAION aesthetic predictor), and Imaging Quality (using MUSIQ). The weighted average is calculated by assigning weights to each metric. All metrics are given a weight of 1, except for the Dynamic Degree metric, which is assigned a weight of 0.5. We generate 700 videos for each model using EvalCrafter prompts that are not seen during training. We find that VADER-Pick has the best consistency score, while VADER-HPS shows the best aesthetic and imaging quality. We find VADER-HPS performs the best overall.
Model | Subject Consistency | Background Consistency | Motion Smoothness | Dynamic Degree | Aesthetic Quality | Imaging Quality | Temporal Flickering | Quality Score |
---|---|---|---|---|---|---|---|---|
VideoCrafter2 | 96.85 | 98.22 | 97.73 | 42.50 | 63.13 | 67.22 | 98.41 | 82.20 |
Pika | 96.76 | 98.95 | 99.51 | 37.22 | 63.15 | 62.33 | 99.77 | 82.68 |
Gen-2 | 97.61 | 97.61 | 99.58 | 18.89 | 66.96 | 67.42 | 99.56 | 82.47 |
T2V Turbo (VC2) | 96.28 | 97.02 | 97.34 | 49.17 | 63.04 | 72.49 | 97.48 | 82.57 |
VADER-Aesthetic and HPS | 95.79 | 96.71 | 97.06 | 66.94 | 67.04 | 69.93 | 98.19 | 84.15 |
VBench evaluation results VADER using standard prompt suite. The base model is VideoCrafter2, Pika (2023-9), Gen-2 (2023-12), and T2V Turbo (VC2). The metrics used in VBench include: Subject Consistency (consistency of the main subject across frames, evaluated using DINO feature similarity), Background Consistency (using CLIP feature similarity), Motion Smoothness (fluidity of motion, based on motion priors from a frame interpolation model), Dynamic Degree (extent of motion in the video, estimated with RAFT), Aesthetic Quality (assessed via the LAION aesthetic predictor), Imaging Quality (using MUSIQ), and Temporal Flickering (computing the mean absolute difference across frames). The Quality Score is calculated by assigning weights to each normalized metric. All metrics are given a weight of 1, except for the Dynamic Degree metric, which is assigned a weight of 0.5. This follows VBench metrics. We find VADER-HPS surpass all baselines in terms of Quality Score, Aesthetic Quality and Dynamic Degree.
Model | Subject Consistency | Background Consistency | Motion Smoothness | Dynamic Degree | Aesthetic Quality | Imaging Quality |
---|---|---|---|---|---|---|
Stable Video Diffusion | 0.9042 | 0.9469 | 0.9634 | 0.8333 | 0.6782 | 0.6228 |
VADER-V-JEPA | 0.9401 | 0.9551 | 0.9669 | 0.8333 | 0.6807 | 0.6384 |
VBench evaluation results for Image to Video diffusion models. The base model is Stable Video Diffusion. We compare Stable Video Diffusion and VADER-V-JEPA. VADER-V-JEPA demonstrates improvements across most metrics, particularly in consistency and aesthetic quality.
Training Step | Reward Value (k=1) | Reward Value (k=10) |
---|---|---|
1 | 5.047 | 5.0946 |
100 | 5.3342 | 5.2523 |
200 | 5.4977 | 5.2072 |
300 | 5.6479 | 5.1906 |
From left to right, we demenstrate videos for training steps 1, 100, 200, and 300.
We ablate the number of truncated backpropagation steps (K) in VADER. For this experiment, we use VADER trained using Aesthetic and HPS Rewards. The base model is VideoCrafter2. We find that higher values of K result in more semantic level changes, while K=1 results in more fine-grained changes, specifically in the earlier steps of training. Further, we find as we train longer, both the models start exhibiting semantic level changes. We also find it is easier to optimize with a smaller value of K, as can be seen in the results below.
We use Aesthetic and HPS reward function to optimize models. Base model of VADER is VideoCrafter2.
DOODL (2 GPU minutes per sample) | DOODL (20 GPU minutes per sample) | VADER (12 GPU hours of training) | |
---|---|---|---|
Reward | 4.9583 | 4.9687 | 5.2810 |
Model | Subject Consistency | Background Consistency | Motion Smoothness | Dynamic Degree | Aesthetic Quality | Imaging Quality | Weighted Average |
---|---|---|---|---|---|---|---|
VideoCrafter2 | 0.9544 | 0.9652 | 0.9688 | 0.5346 | 0.5752 | 0.6677 | 0.7997 |
T2V Turbo (4 Steps) | 0.9639 | 0.9656 | 0.9562 | 0.4771 | 0.6183 | 0.7266 | 0.8126 | T2V Turbo (8 Steps) | 0.9735 | 0.9736 | 0.9572 | 0.3686 | 0.6265 | 0.7168 | 0.8058 |
VADER-Aesthetic and HPS | 0.9659 | 0.9713 | 0.9734 | 0.4741 | 0.6295 | 0.7145 | 0.8167 |
VADER-PickScore | 0.9668 | 0.9727 | 0.9726 | 0.3732 | 0.6094 | 0.6762 | 0.7971 |
VADER-Aesthetic and ViCLIP | 0.9564 | 0.9662 | 0.9714 | 0.5519 | 0.6008 | 0.6566 | 0.8050 |
VADER-VBench | 0.9638 | 0.9678 | 0.9691 | 0.5361 | 0.6393 | 0.7231 | 0.8238 |
Training Curves for VBench Distilled Reward Model. The plot on the left shows the average training and validation loss curves over epochs, showcasing the model's convergence behavior. The figure on the right depicts the average validation accuracy for each epoch, highlighting the model's performance in ranking the preference of video pairs in terms of every metrics, including background consistency, dynamic degree, image quality, motion smoothness, aesthetic quality, PickScore value, HPS score, and subject consistency.
The reward curve when training VADER using the VBench Distilled reward model.
More Videos from VADER are exhibited in Video Gallery.