Atext-to-video model is a form ofgenerative artificial intelligence that uses anatural language description as input to produce avideo relevant to the input text.[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of videodiffusion models.[2]
The examples and perspective in this sectionmay not represent aworldwide view of the subject. You mayimprove this section, discuss the issue on thetalk page, or create a new section, as appropriate.(August 2024) (Learn how and when to remove this message) |
There are different models, includingopen source models. Chinese-language input[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented onGitHub in 2022.[4] That year,Meta Platforms released a partial text-to-video model called "Make-A-Video",[5][6][7] andGoogle'sBrain (laterGoogle DeepMind) introduced Imagen Video, a text-to-video model with 3DU-Net.[8][6][9][10][11]
In February 2023,Runway released Gen-1 and Gen-2, among the first commercially available text-to-video and video-to-video models accessible to the public through a web interface. Gen-1, initially released as a video-to-video model, allowed users to transform existing video footage using text or image prompts.[12] Gen-2, introduced in March 2023 and made publicly available in June 2023, added text-to-video capabilities, enabling users to generate videos from text prompts alone.[13]
In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.[14] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.[15] In the same month,Adobe introduced Firefly AI as part of its features.[16]
In January 2024,Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.[17]Matthias Niessner andLourdes Agapito at AI companySynthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.[18] In June 2024, Luma Labs launched itsDream Machine video tool.[19][20] That same month,[21]Kuaishou extended its Kling AI text-to-video model to international users. In July 2024,TikTok ownerByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.[22] By September 2024, the Chinese AI companyMiniMax debuted its video-01 model, joining other established AI model companies likeZhipu AI,Baichuan, andMoonshot AI, which contribute to China's involvement in AI technology.[23] In December 2024Lightricks launchedLTX Video as an open source model.[24]
Alternative approaches to text-to-video models include[25] Google's Phenaki, Hour One,Colossyan,[3]Runway's Gen-3 Alpha,[26][27] and OpenAI'sSora,[28][29] Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.[30]FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA.[31]Google was preparing to launch a video generation tool namedVeo forYouTube Shorts in 2025.[32] In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models.[33] In July 2025 Lightricks released an update to LTX Video capable of generating clips reaching 60 seconds,[34][35] and in October 2025 it released LTX-2, with audio capabilities built in.[36]
There are several architectures that have been used to create text-to-video models. Similar totext-to-image models, these models can be trained usingRecurrent Neural Networks (RNNs) such aslong short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively.[37] An alternative for these include transformer models.Generative adversarial networks (GANs),Variational autoencoders (VAEs), — which can aid in the prediction of human motion[38] — and diffusion models have also been used to develop the image generation aspects of the model.[39]
Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M.[40][41] These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM.[40][41] These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.
The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence. This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.[41] TheWill Smith Eating Spaghetti test is abenchmark for models.[42]
Despite the rapid evolution of text-to-video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs.[43][44] Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility.[44][43]
Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model's ability to align generated video with the user's intended message.[44][41] Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation.[44]
Another issue with the outputs is that text or fine details in AI-generated videos often appear garbled, a problem thatstable diffusion models also struggle with. Examples include distorted hands and unreadable text.
This sectionrelies largely or entirely upon asingle source. Relevant discussion may be found on thetalk page. Please helpimprove this article by introducingcitations to additional sources at this section.(December 2024) (Learn how and when to remove this message) |
The deployment of text-to-video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent.[40] Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter outNSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.[40]
This sectionrelies largely or entirely upon asingle source. Relevant discussion may be found on thetalk page. Please helpimprove this article by introducingcitations to additional sources at this section.(December 2024) (Learn how and when to remove this message) |
Text-to-video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate content.[45]
During theRusso-Ukrainian war, fake videos made withArtificial Intelligence were created as part of apropaganda war against Ukraine and shared insocial media. These included depictions of children in theUkrainian Armed Forces, fake ads targeting children encouraging them to denounce critics of theUkrainian government, or fictitious statements byUkrainian PresidentVolodymyr Zelenskyy about the country's surrender, among others.[46][47][48][49][50][51]
Kaur vs Kore is the firstIndianfeature film made usinggenerative AI which features dual role for the AI character ofSunny Leone, set to release in 2026.[52][53][54]
Chiranjeevi Hanuman – The Eternal is anIndian movie made entirely usingGenerative AI created by Vijay Subramaniam which is set for theatrical release in 2026. The movie was widely criticised by the Film makers in theBollywood industry for entirely relying on AI and use of AI was seen as an existential threat to their career.[55][56][57]
Mahabharat: Ek Dharmayudh is anIndian mythological OTT series released on October 2025 and streamed onJioHotstar. It is recognized as the first series created entirely usingartificial intelligence to generate visuals and character animations and consists of 100 episodes.[58][59][60]
| Model/Product | Company | Year released | Status | Key features | Capabilities | Pricing | Video length | Supported languages |
|---|---|---|---|---|---|---|---|---|
| Synthesia | Synthesia | 2019 | Released | AI avatars, multilingual support for 60+ languages, customization options[61] | Specialized in realistic AI avatars for corporate training and marketing[61] | Subscription-based, starting around $30/month | Varies based on subscription | 60+ |
| Vexub | Vexub | 2023 | Released | Text-to-video from prompt, focus on TikTok and YouTube storytelling formats for social media[62] | Generates AI videos (1–15 mins) from text prompts; includes editing and voice features[62] | Subscription-based, with various plans | Up to ~15 minutes | 70+ |
| InVideo AI | InVideo | 2021 | Released | AI-powered video creation, large stock library, AI talking avatars[61] | Tailored for social media content with platform-specific templates[61] | Free plan available, Paid plans starting at $16/month | Varies depending on content type | Multiple (not specified) |
| Fliki | Fliki AI | 2022 | Released | Text-to-video with AI avatars and voices, extensive language and voice support[61] | Supports 65+ AI avatars and 2,000+ voices in 70 languages[61] | Free plan available, Paid plans starting at $30/month | Varies based on subscription | 70+ |
| Runway Gen-2 | Runway AI | 2023 | Released | Multimodal video generation from text, images, or videos[63] | High-quality visuals, various modes like stylization and storyboard[63] | Free trial, Paid plans (details not specified) | Up to 16 seconds | Multiple (not specified) |
| Pika Labs | Pika Labs | 2024 | Beta | Dynamic video generation, camera and motion customization[64] | User-friendly, focused on natural dynamic generation[64] | Currently free during beta | Flexible, supports longer videos with frame continuation | Multiple (not specified) |
| Runway Gen-3 Alpha | Runway AI | 2024 | Alpha | Enhanced visual fidelity, photorealistic humans, fine-grained temporal control[65] | Ultra-realistic video generation with precise key-framing and industry-level customization[65] | Free trial available, custom pricing for enterprises | Up to 10 seconds per clip, extendable | Multiple (not specified) |
| Google Veo | 2024 | Released | Google Gemini prompting, voice acting, sound effects, background music. Cinema style realistic videos.[66] | Can generate very realistic and detailed character models/scenes/clips, with accommodating and matching voice acting, ambient sounds, and background music. Ability to extend clips with continuity.[67] | Varies ($250 Google Pro/Ultra AI subscription, and additional AI credit Top-Ups) | Eight seconds for individual clips (however clips can be continued/extended as separate clips) | 50+ | |
| OpenAI Sora | OpenAI | 2024 | Alpha | Deep language understanding, high-quality cinematic visuals, multi-shot videos[68] | Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures[68] | Pricing not yet disclosed | Expected to generate longer videos; duration specifics TBD | Multiple (not specified) |
| Runway Gen-4 | Runway | 2025 | Released | Consistent characters across scenes,[69] world consistency,[70] camera control, physics simulation | Generates 5-10 second clips with consistent characters, objects, and environments across multiple shots[71] | Credit-based subscription, part of paid plans | 5-10 seconds | Multiple (not specified) |
Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
{{cite web}}: CS1 maint: multiple names: authors list (link)