Movatterモバイル変換

[0]ホーム

Jump to content

Text-to-image model

Edit links

From Wikipedia, the free encyclopedia

Machine learning model

An image conditioned on the prompt`an astronaut riding a horse, byHiroshige`, generated byStable Diffusion 3.5, a large-scale text-to-image model first released in 2022

Atext-to-image model is amachine learning model which takes an inputnatural language prompt and produces an image matching that description.

Text-to-image models began to be developed in the mid-2010s during the beginnings of theAI boom, as a result of advances indeep neural networks. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI'sDALL-E 2,Google Brain'sImagen, Stability AI'sStable Diffusion, andMidjourney—began to be considered to approach the quality ofreal photographs and human-drawnart.

Text-to-image models are generallylatent diffusion models, which combine alanguage model, which transforms the input text into alatent representation, and agenerative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text datascraped from the web.^[1]

History

[edit]

Before the rise ofdeep learning,^[when?] attempts to build text-to-image models were limited tocollages by arranging existing component images, such as from a database ofclip art.^[2]^[3]

The inverse task,image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models.^[4]

A stop sign is flying in blue skies. by AlignDRAW (2015)^[5]

A stop sign is flying in blue skies by OpenAI's DALL-E 2 (2022), DALL-E 3 (2023), and GPT Image 1 (2025)

The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from theUniversity of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used arecurrent variational autoencoder with anattention mechanism) to be conditioned on text sequences.^[4] Images generated by alignDRAW were in smallresolution (32×32 pixels, attained fromresizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from thetraining set.^[4]^[6]

In 2016, Reed, Akata, Yan et al. became the first to usegenerative adversarial networks for the text-to-image task.^[6]^[7] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like"an all black bird with a distinct thick, rounded bill". A model trained on the more diverseCOCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.^[6] Later systems include VQGAN-CLIP,^[8] XMC-GAN, and GauGAN2.^[9]

One of the first text-to-image models to capture widespread public attention wasOpenAI'sDALL-E, atransformer system announced in January 2021.^[10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,^[11] followed byStable Diffusion that was publicly released in August 2022.^[12] In August 2022,text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved bytextual inversion, namely, finding a new text term that correspond to these images.

Following other text-to-image models,language model-poweredtext-to-video platforms such as Runway, Make-A-Video,^[13] Imagen Video,^[14] Midjourney,^[15] and Phenaki^[16] can generate video from text and/or text/image prompts.^[17]

Architecture and training

[edit]

This sectionrelies largely or entirely upon asingle source. Relevant discussion may be found on thetalk page. Please helpimprove this article by introducingcitations to additional sources at this section.(December 2024) (Learn how and when to remove this message)

High-level architecture showing the state of AI art machine learning models, and notable models and applications as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with arecurrent neural network such as along short-term memory (LSTM) network, thoughtransformer models have since become a more popular option. For the image generation step, conditionalgenerative adversarial networks (GANs) have been commonly used, withdiffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using alarge language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.^[18]

Datasets

[edit]

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Originally, the main focus of COCO was on the recognition of objects and scenes in images. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter.^[7]

One of the largest open datasets for training text-to-image models is LAION-5B, containing more than 5 billion image-text pairs. This dataset was created using web scraping and automatic filtering based on similarity to high-quality artwork and professional photographs. Because of this, however, it also contains controversial content, which has led to discussions about the ethics of its use.

Some modern AI platforms not only generate images from text but also create synthetic datasets to improve model training and fine-tuning. These datasets help avoid copyright issues and expand the diversity of training data.^[19]

Quality evaluation

[edit]

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement.^[7]

A common algorithmic metric for assessing image quality and diversity is theInception Score (IS), which is based on the distribution of labels predicted by a pretrainedInceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the relatedFréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model.^[7]

Impact and applications

[edit]

This section is an excerpt fromArtificial intelligence visual art § Impact and applications.[edit]

AI has the potential for asocietal transformation, which may include enabling the expansion of noncommercial niche genres (such ascyberpunk derivatives likesolarpunk) by amateurs, novel entertainment, fast prototyping,^[20] increasing art-making accessibility,^[20] and artistic output per effort or expenses or time^[20]—e.g., via generating drafts, draft-definitions, and image components (inpainting). Generated images are sometimes used as sketches,^[21] low-cost experiments,^[22] inspiration, or illustrations ofproof-of-concept-stage ideas. Additional functionalities or improvements may also relate to post-generation manual editing (i.e., polishing), such as subsequent tweaking with an image editor.^[22]

List of notable text-to-image models

[edit]


Name	Release date	Developer	License
DALL-E	January 2021	OpenAI	Proprietary
DALL-E 2	April 2022
DALL-E 3	September 2023
GPT Image 1	March 2025^{[note 1]}
Ideogram 0.1	August 2023	Ideogram
Ideogram 2.0	August 2024
Ideogram 3.0	March 2025
Imagen	April 2023	Google
Imagen 2	December 2023^[24]
Imagen 3	May 2024
Imagen 4	May 2025
Firefly	March 2023	Adobe Inc.
Midjourney	July 2022	Midjourney, Inc.
Halfmoon	March 2025	Reve AI, Inc.
Stable Diffusion	August 2022	Stability AI	Stability AI Community License^{[note 2]}
Flux	August 2024	Black Forest Labs	Apache License^{[note 3]}
Aurora	December 2024	xAI	Proprietary
RunwayML	2018	Runway AI, Inc.
Recraft	May 2023	Recraft, Inc.
AuraFlow	July 2024	FAL	Apache License
HiDream	April 2025	HiDream-AI	MIT license

Explanatory notes

[edit]

^Initially referred as the GPT-4o image generation.^[23]
^This license can be used by individuals and organizations up to $1 million in revenue, for organizations with annual revenue more than $1 million, Stability AI Enterprise License is needed. All outputs are retained by users regardless of revenue
^For the Schnell model, the Dev model is using a non-commercial license while the Pro model is proprietary (only available asAPI)

References

[edit]

^Vincent, James (May 24, 2022)."All these images were generated by Google's latest text-to-image AI".The Verge. Vox Media.Archived from the original on February 15, 2023. RetrievedMay 28, 2022.
^Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019),A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis,arXiv:1910.09399
^Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007)."A text-to-picture synthesis system for augmenting communication"(PDF).AAAI.7:1590–1595.Archived(PDF) from the original on September 7, 2022. RetrievedSeptember 7, 2022.
^^a ^b ^cMansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention".ICLR.arXiv:1511.02793.
^Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan (February 29, 2016). "Generating Images from Captions with Attention".International Conference on Learning Representations.arXiv:1511.02793.
^^a ^b ^cReed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016)."Generative Adversarial Text to Image Synthesis"(PDF).International Conference on Machine Learning.arXiv:1605.05396.Archived(PDF) from the original on March 16, 2023. RetrievedSeptember 7, 2022.
^^a ^b ^c ^dFrolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021)."Adversarial text-to-image synthesis: A review".Neural Networks.144:187–209.arXiv:2101.09983.doi:10.1016/j.neunet.2021.07.019.PMID 34500257.S2CID 231698782.
^Rodriguez, Jesus (September 27, 2022)."🌅 Edge#229: VQGAN + CLIP".thesequence.substack.com.Archived from the original on December 4, 2022. RetrievedOctober 10, 2022.
^Rodriguez, Jesus (October 4, 2022)."🎆🌆 Edge#231: Text-to-Image Synthesis with GANs".thesequence.substack.com.Archived from the original on December 4, 2022. RetrievedOctober 10, 2022.
^Coldewey, Devin (January 5, 2021)."OpenAI's DALL-E creates plausible images of literally anything you ask it to".TechCrunch.Archived from the original on January 6, 2021. RetrievedSeptember 7, 2022.
^Coldewey, Devin (April 6, 2022)."OpenAI's new DALL-E model draws anything — but bigger, better and faster than before".TechCrunch.Archived from the original on May 6, 2023. RetrievedSeptember 7, 2022.
^"Stable Diffusion Public Release".Stability.Ai.Archived from the original on August 30, 2022. RetrievedOctober 27, 2022.
^Kumar, Ashish (October 3, 2022)."Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text".MarkTechPost.Archived from the original on December 1, 2022. RetrievedOctober 3, 2022.
^Edwards, Benj (October 5, 2022)."Google's newest AI generator creates HD video from text prompts".Ars Technica.Archived from the original on February 7, 2023. RetrievedOctober 25, 2022.
^Rodriguez, Jesus (October 25, 2022)."🎨 Edge#237: What is Midjourney?".thesequence.substack.com.Archived from the original on December 4, 2022. RetrievedOctober 26, 2022.
^"Phenaki".phenaki.video.Archived from the original on October 7, 2022. RetrievedOctober 3, 2022.
^Edwards, Benj (September 9, 2022)."Runway teases AI-powered text-to-video editing using written prompts". Ars Technica.Archived from the original on January 27, 2023. RetrievedSeptember 12, 2022.
^Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (May 23, 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding".arXiv:2205.11487 [cs.CV].
^Martin (January 29, 2025)."AI-Powered Text and Image Generation".Debatly.
^^a ^b ^cElgan, Mike (November 1, 2022)."How 'synthetic media' will transform business forever".Computerworld.Archived from the original on February 10, 2023. RetrievedNovember 9, 2022.
^Roose, Kevin (October 21, 2022)."A.I.-Generated Art Is Already Transforming Creative Work".The New York Times.Archived from the original on February 15, 2023. RetrievedNovember 16, 2022.
^^a ^bLeswing, Kif."Why Silicon Valley is so excited about awkward drawings done by artificial intelligence".CNBC.Archived from the original on February 8, 2023. RetrievedNovember 16, 2022.
^"Introducing 4o Image Generation".OpenAI. March 25, 2025. RetrievedMarch 27, 2025.
^"Imagen 2 on Vertex AI is now generally available".Google Cloud Blog.Archived from the original on February 21, 2024. RetrievedJanuary 2, 2024.

Artificial intelligence (AI)

History (timeline)

Concepts

Applications

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Computer vision Speech synthesis 15.ai ElevenLabs Speech recognition Whisper Facial recognition AlphaFold Text-to-image models Aurora DALL-E Firefly Flux Ideogram Imagen Midjourney Recraft Stable Diffusion Text-to-video models Dream Machine Runway Gen Hailuo AI Kling Sora Veo Music generation Suno AI Udio
Text	Word2vec Seq2seq GloVe BERT T5 Llama Chinchilla AI PaLM GPT 1 2 3 J ChatGPT 4 4o o1 o3 4.5 4.1 o4-mini Claude Gemini chatbot Grok LaMDA BLOOM DBRX Project Debater IBM Watson IBM Watsonx Granite PanGu-Σ DeepSeek Qwen
Decisional	AlphaGo AlphaZero OpenAI Five Self-driving car MuZero Action selection AutoGPT Robot control