Semantic Frame Interpolation

1Shanghai Jiao Tong University
2Shanghai Innovation Institute
3Zhejiang University
4Tencent YouTu Lab

Highlights

  • We formally define Semantic Frame Interpolation (SFI) as a novel generative in-betweening paradigm that enables customized intermediate content generation under more challenging control conditions, significantly expanding the diversity of producible outputs.
  • We propose SemFi, a novel framework for SFI, introducing Mixture-of-LoRA for adaptive generation. It achieves better performance on arbitrary frame counts while maintaining precise control.
  • We introduce SFI-300K, the first large-scale dataset for SFI, featuring 300k diverse clips with rich annotations. The accompanying SFIBench provides standardized evaluation across fidelity, coherence, and instruction adherence at varying generation lengths.
  • Abstract

    Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model’s performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.

    Overview of SemFi

    SemFi employs a Mixture-of-LoRA module to dynamically activate the most suitable LoRA parameters for the target frame count, enabling high-quality semantic frame interpolation across diverse frame generation requirements.

    Intermediate Frame Generation Across Variable Frame Counts

    GIF 1
    GIF 2
    GIF 3
    GIF 4
    GIF 5
    GIF 6

    Semantic-Controlled Frame Interpolation

    GIF 1
    GIF 2

    Acknowledgments

    Our project benefits from the amazing open-source projects:

    We would like to thank the authors of these projects for their contributions to the community.

    BibTeX

    @misc{hong2025semanticframeinterpolation,
          title={Semantic Frame Interpolation}, 
          author={Yijia Hong and Jiangning Zhang and Ran Yi and Yuji Wang and Weijian Cao and Xiaobin Hu and Zhucun Xue and Yabiao Wang and Chengjie Wang and Lizhuang Ma},
          year={2025},
          eprint={2507.05173},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2507.05173}, 
    }