This repository hosts the official implementation of ReImagine, a framework for controllable high-quality human video generation via image-first synthesis. For more context, see the paper on arXiv and the project website.
- April 23 2026: Updated the Image-First Synthesis demo.
- April 22 2026: Initial repository launch.
Stay tuned for further updates!
We develop and test with Python 3.10, PyTorch 2.4.1, and CUDA 12.4. Install the CUDA 12.4 PyTorch wheels, then install this package in editable mode:
conda create -n reimagine python=3.10
conda activate reimagine
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install -e .Download checkpoints with the Hugging Face Hub CLI (hf download or huggingface-cli download on older installs). For FLUX.1-Kontext-dev, you can skip the monolithic flux1-kontext-dev.safetensors and the vae/ tree:
hf download black-forest-labs/FLUX.1-Kontext-dev \
--local-dir ./models/FLUX.1-Kontext-dev \
--exclude "flux1-kontext-dev.safetensors" \
--exclude "vae/**"For ControlNet:
hf download jasperai/Flux.1-dev-Controlnet-Surface-Normals \
--local-dir ./models/Flux.1-dev-Controlnet-Surface-NormalsReImagine LoRA weight files are hosted on Hugging Face at taited/ReImagine-Pretrained.
| SMPL-X Params | Input Type | File | Status |
|---|---|---|---|
| w/o | Canonical human (front & back views) | kontext-wo_smplx-lora.safetensors |
Available |
| w/o | Disentangled assets (face, clothes, shoes) | TBA | Planned |
Download: Use the same Hugging Face CLI as for the base models:
hf download taited/ReImagine-Pretrained --local-dir ./models/ReImagine-PretrainedOnce you have prepared the pretrained weights, use inference_img.py to infer each frame. This script requires two image inputs: a wide reference image (left = front, right = back) and a normal map. The normal map is generated from SMPL-X's global coordinate system based on camera parameters.
For more details on the usage of inference_img.py, check the full guide and example.
The code for Temporal-Refinement Video Synthesis is currently being organized for open-source release. Once available, it will allow inference on video data with temporal refinement.
Stay tuned for updates!
- Code for Image-First Synthesis inference (
inference_img.py) - Pretrained LoRA weights (available for download)
- Documentation and usage instructions for basic inference
- Code for Temporal-Refinement Video Synthesis
- Pretrained model weights for Disentangled assets (face, clothes, shoes)
- Full dataset release
We are actively organizing and updating the repository. Updates will be added here as each item becomes available.
This repository’s implementation is based on DiffSynth Studio (ModelScope). We thank the authors and maintainers for releasing their work. The upstream project is licensed under the Apache License 2.0.
We acknowledge the contributions of the teams behind FLUX.1-Kontext-dev and Flux.1-dev-Controlnet-Surface-Normals for their open-source releases that this project builds on.
If you find this project useful, please consider citing our paper:
@article{sun2025rethinking,
title={ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis},
author={Sun, Zhengwentai and Zheng, Keru and Li, Chenghong and Liao, Hongjie and Yang, Xihe and Li, Heyuan and Zhi, Yihao and Ning, Shuliang and Cui, Shuguang and Han, Xiaoguang},
journal={arXiv preprint arXiv:2604.19720},
year={2026},
url={https://arxiv.org/abs/2604.19720v1}
}