
Alibaba releases VACE, an open-source AI video editor
Alibaba has introduced Wan2.1-VACE (Video All-in-one Creation and Editing), an open-source artificial intelligence model that integrates video generation and editing capabilities within a single multimodal framework.
Wan2.1-VACE is part of Alibaba's Wan2.1 series and is the first open-source model reported to provide unified video generation and editing solutions for a variety of content creation tasks. The system is designed to support inputs from text, images and video, enabling creators to transform different forms of media into video content rapidly.
The model's comprehensive editing tools include referencing still images or selected video frames, repainting video sequences, modifying chosen regions within a video, and spatio-temporal extensions, which collectively enable more flexible editing workflows. These editing functionalities are applicable across multiple sectors, such as short-form social media content, advertising and marketing production, post-production effects for film and television, and educational training resources.
According to Alibaba, Wan2.1-VACE enables users to generate videos featuring specific interactions between subjects based on image samples. Static images can be converted into moving video sequences with realistic motion effects, providing creators with options for pose transfer, motion control, depth simulation, and recolouring.
The system's selective editing functions allow additional content to be added, altered or removed from designated video regions without impacting other segments, while the video boundary extension feature can expand a video's spatial dimension and autonomously generate complementary content.
"As an all-in-one AI model, Wan2.1-VACE delivers unparalleled versatility, enabling users to seamlessly combine multiple functions and unlock innovative potential. Users can turn a static image into video while controlling the movement of objects by specifying the motion trajectory. They can seamlessly replace characters or objects with specified references, animate referenced characters, control poses, and expand a vertical image horizontally to create a horizontal video while adding new elements through referencing," the company stated.
The technical architecture of Wan2.1-VACE is designed around several new concepts, including the Video Condition Unit (VCU), which serves as a unified interface supporting the processing of different input modalities—text, images, video footage, and masks. The model also incorporates a Context Adapter structure, using representations of time and space to inject task-specific information across a range of video editing and synthesis applications.
"Wan2.1-VACE leverages several innovative technologies, to take into account the needs of different video editing tasks during construction and design. Its unified interface, called Video Condition Unit (VCU), supports unified processing of multimodal inputs such as text, images, video, and masks," said the company. "The model employs a Context Adapter structure that injects various task concepts using formalised representations of temporal and spatial dimensions. This innovative design enables it to flexibly manage a wide range of video synthesis tasks."
Alibaba stated that advances in the model architecture support quick and efficient content creation for social media, marketing, and entertainment, as well as training and education. The company also highlighted the resource intensity of training video foundation models, noting, "Training video foundation models requires immense computing resources and vast amounts of high-quality training data. Open access helps lower the barrier for more businesses to leverage AI, enabling them to create high-quality visual content tailored to their needs, quickly and cost-effectively."
The Wan2.1-VACE model is being released in two versions—a 14-billion-parameter model and a 1.3-billion-parameter version. Both will be available for free download on platforms such as Hugging Face, GitHub, and Alibaba Cloud's ModelScope open-source community.
Earlier this year, Alibaba released four other Wan2.1 models, followed in April by a video generation system supporting start and end frame creation. Collectively, these models have accumulated more than 3.3 million downloads on platforms including Hugging Face and ModelScope, highlighting growing interest in AI-driven video production tools.