Prompt engineering is the process of structuring or crafting an instruction in order to produce better outputs from agenerative artificial intelligence (AI) model.[1]
Aprompt isnatural language text describing the task that an AI should perform.[2] A prompt for a text-to-textlanguage model can be a query, a command, or a longer statement including context, instructions, and conversation history. Prompt engineering may involve phrasing a query, specifying a style, choice of words and grammar,[3] providing relevant context, or describing a character for the AI to mimic.[1]
When communicating with atext-to-image or a text-to-audio model, a typical prompt is a description of a desired output such as "a high-quality photo of an astronaut riding a horse"[4] or "Lo-fi slow BPM electro chill with organic samples".[5] Prompting a text-to-image model may involve adding, removing, or emphasizing words to achieve a desired subject, style, layout, lighting, and aesthetic.[6]
In 2018, researchers first proposed that all previously separate tasks innatural language processing (NLP) could be cast as a question-answering problem over a context. In addition, they trained a first single, joint, multi-task model that would answer any task-related question like "What is the sentiment" or "Translate this sentence to German" or "Who is the president?"[7]
In 2025, researchers proposed a reflexive prompt engineering framework that incorporates ethical and governance considerations into prompt design and management.[8]
TheAI boom saw an increase in the amount of "prompting technique" to get the model to output the desired outcome and avoidnonsensical output, a process characterized bytrial-and-error.[9] After the release ofChatGPT in 2022, prompt engineering was soon seen as an important business skill, albeit one with an uncertain economic future.[1]
A repository for prompts reported that over 2,000 public prompts for around 170 datasets were available in February 2022.[10] In 2022, thechain-of-thought prompting technique was proposed byGoogle researchers.[11][12] In 2023, several text-to-text and text-to-image prompt databases were made publicly available.[13][14] The Personalized Image-Prompt (PIP) dataset, a generated image-text dataset that has been categorized by 3,115 users, has also been made available publicly in 2024.[15]
Multiple distinct prompt engineering techniques have been published.
According to Google Research,chain-of-thought (CoT) prompting is a technique that allowslarge language models (LLMs) to solve a problem as a series of intermediate steps before giving a final answer. In 2022,Google Brain reported that chain-of-thought prompting improvesreasoning ability by inducing the model to answer a multi-step problem with steps of reasoning that mimic atrain of thought.[11][16] Chain-of-thought techniques were developed to help LLMs handle multi-step reasoning tasks, such asarithmetic orcommonsense reasoning questions.[17][18]
For example, given the question, "Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?", Google claims that a CoT prompt might induce the LLM to answer "A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9."[11] When applied toPaLM, a 540 billion parameterlanguage model, according to Google, CoT prompting significantly aided the model, allowing it to perform comparably with task-specificfine-tuned models on several tasks, achievingstate-of-the-art results at the time on the GSM8Kmathematical reasoningbenchmark.[11] It is possible to fine-tune models on CoT reasoning datasets to enhance this capability further and stimulate betterinterpretability.[19][20]
As originally proposed by Google,[11] each CoT prompt is accompanied by a set of input/output examples—calledexemplars—to demonstrate the desired model output, making it afew-shot prompting technique. However, according to a later paper from researchers at Google and theUniversity of Tokyo, simply appending the words "Let's think step-by-step"[21] was also effective, which allowed for CoT to be employed as azero-shot technique.
An example format offew-shot CoT prompting with in-context exemplars:[22]
Q: {example question 1} A: {example answer 1} ... Q: {example questionn} A: {example answern} Q: {question} A: {LLM output}An example format ofzero-shot CoT prompting:[21]
Q: {question}. Let's think step by step. A: {LLM output}In-context learning, refers to a model's ability to temporarily learn from prompts. For example, a prompt may include a few examples for a model to learn from, such as asking the model to complete "maison → house,chat → cat,chien →" (the expected response beingdog),[23] an approach calledfew-shot learning.[24]
In-context learning is anemergent ability[25] of large language models. It is an emergent property of model scale, meaning thatbreaks[26] in downstream scaling laws occur, leading to its efficacy increasing at a different rate in larger models than in smaller models.[25][11] Unlike training andfine-tuning, which produce lasting changes, in-context learning is temporary.[27] Training models to perform in-context learning can be viewed as a form ofmeta-learning, or "learning to learn".[28]
Self-Consistency performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts.[29][30]
Tree-of-thought prompting generalizes chain-of-thought by generating multiple lines of reasoning in parallel, with the ability to backtrack or explore other paths. It can usetree search algorithms likebreadth-first,depth-first, orbeam.[30][31]
Research consistently demonstrates that LLMs are highly sensitive to subtle variations in prompt formatting, structure, and linguistic properties. Some studies have shown up to 76 accuracy points across formatting changes in few-shot settings.[32] Linguistic features significantly influence prompt effectiveness—such as morphology, syntax, and lexico-semantic changes—which meaningfully enhance task performance across a variety of tasks.[3][33] Clausal syntax, for example, improves consistency and reduces uncertainty in knowledge retrieval.[34] This sensitivity persists even with larger model sizes, additional few-shot examples, or instruction tuning.
To address sensitivity of models and make them more robust, several methods have been proposed. FormatSpread facilitates systematic analysis by evaluating a range of plausible prompt formats, offering a more comprehensive performance interval.[32] Similarly, PromptEval estimates performance distributions across diverse prompts, enabling robust metrics such as performance quantiles and accurate evaluations under constrained budgets.[35]
Retrieval-augmented generation (RAG) is a technique that enablesgenerative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with an LLM so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existingtraining data. This allows LLMs to use domain-specific and/or updated information.[36]
RAG improves large language models by incorporatinginformation retrieval before generating responses. Unlike traditional LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources. According toArsTechnica, "RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." This method helps reduceAI hallucinations, which have led to real-world issues like chatbots inventing policies or lawyers citing nonexistent legal cases. By dynamically retrieving information, RAG enables AI to provide more accurate responses without frequent retraining.[37]

GraphRAG (coined byMicrosoft Research) is a technique that extends RAG with the use of a knowledge graph (usually, LLM-generated) to allow the model to connect disparate pieces of information, synthesize insights, and holistically understand summarized semantic concepts over large data collections. It was shown to be effective on datasets like the Violent Incident Information from News Articles (VIINA).[38][39]
Earlier work showed the effectiveness of using aknowledge graph for question answering using text-to-query generation.[40] These techniques can be combined to search across both unstructured and structured data, providing expanded context, and improved ranking.
LLMs themselves can be used to compose prompts for LLMs.[41] Theautomatic prompt engineer algorithm uses one LLM tobeam search over prompts for another LLM:[42][43]
CoT examples can be generated by LLM themselves. In "auto-CoT", a library of questions are converted to vectors by a model such asBERT. The question vectors areclustered. Questions close to thecentroid of each cluster are selected, in order to have a subset of diverse questions. An LLM does zero-shot CoT on each selected question. The question and the corresponding CoT answer are added to a dataset of demonstrations. These diverse demonstrations can then added to prompts for few-shot learning.[44]
Automatic prompt optimization techniques refine prompts for LLMs using test datasets and comparison metrics to determine whether changes improve performance. Methods such as MiPRO (Minimum Perturbation Prompt Optimization) update prompts with minimal edits,[45] while GEPA (Gradient-based Prompt Augmentation) applies gradient signals over model likelihoods.[46] There are also open-source implementations of such algorithms in frameworks like DSPy[47] and Opik.[48]
In 2022,text-to-image models likeDALL-E 2,Stable Diffusion, andMidjourney were released to the public. These models take text prompts as input and use them to generate images.[49][6]
Early text-to-image models typically do not understand negation, grammar and sentence structure in the same way aslarge language models, and may thus require a different set of prompting techniques. The prompt "a party with no cake" may produce an image including a cake.[50] As an alternative,negative prompts allow a user to indicate, in a separate prompt, which terms shouldnot appear in the resulting image.[51] Techniques such as framing the normal prompt into asequence-to-sequence language modeling problem can be used to automatically generate an output for the negative prompt.[52]
A text-to-image prompt commonly includes a description of the subject of the art, the desired medium (such asdigital painting orphotography), style (such ashyperrealistic orpop-art), lighting (such asrim lighting orcrepuscular rays), color, and texture.[53] Word order also affects the output of a text-to-image prompt. Words closer to the start of a prompt may be emphasized more heavily.[54]
TheMidjourney documentation encourages short, descriptive prompts: instead of "Show me a picture of lots of blooming California poppies, make them bright, vibrant orange, and draw them in an illustrated style with colored pencils", an effective prompt might be "Bright orange California poppies drawn with colored pencils".[50]
Some text-to-image models are capable of imitating the style of particular artists by name. For example, the phrasein the style of Greg Rutkowski has been used in Stable Diffusion and Midjourney prompts to generate images in the distinctive style of Polish digital artistGreg Rutkowski.[55] Famous artists such asVincent van Gogh andSalvador Dalí have also been used for styling and testing.[56]
Some approaches augment or replace natural language text prompts with non-text input.
For text-to-image models,textual inversion performs an optimization process to create a newword embedding based on a set of example images. This embedding vector acts as a "pseudo-word" which can be included in a prompt to express the content or style of the examples.[57]
In 2023,Meta's AI research released Segment Anything, acomputer vision model that can performimage segmentation by prompting. As an alternative to text prompts, Segment Anything can accept bounding boxes, segmentation masks, and foreground/background points.[58]
In "prefix-tuning",[59] "prompt tuning", or "soft prompting",[60] floating-point-valued vectors are searched directly bygradient descent to maximize the log-likelihood on outputs.
Formally, let be a set of soft prompt tokens (tunable embeddings), while and be the token embeddings of the input and output respectively. During training, the tunable embeddings, input, and output tokens are concatenated into a single sequence, and fed to the LLMs. Thelosses are computed over the tokens; the gradients arebackpropagated to prompt-specific parameters: in prefix-tuning, they are parameters associated with the prompt tokens at each layer; in prompt tuning, they are merely the soft tokens added to the vocabulary.[61]
More formally, this is prompt tuning. Let an LLM be written as, where is a sequence of linguistic tokens, is the token-to-vector function, and is the rest of the model. In prefix-tuning, one provides a set of input-output pairs, and then use gradient descent to search for. In words, is the log-likelihood of outputting, if the model first encodes the input into the vector, then prepend the vector with the "prefix vector", then apply. For prefix tuning, it is similar, but the "prefix vector" is pre-appended to the hidden states in every layer of the model.[citation needed]
An earlier result uses the same idea of gradient descent search, but is designed for masked language models like BERT, and searches only over token sequences, rather than numerical vectors. Formally, it searches for where is ranges over token sequences of a specified length.[62]
While the process of writing and refining a prompt for an LLM or generative AI shares some parallels with an iterative engineering design process, such as through discovering 'best principles' to reuse and discovery through reproducible experimentation, the actual learned principles and skills depend heavily on the specific model being learned rather than being generalizable across the entire field of prompt-based generative models. Such patterns are also volatile and exhibit significantly different results from seemingly insignificant prompt changes.[63][64] According toThe Wall Street Journal in 2025, the job of prompt engineer was one of the hottest in 2023, but has become obsolete due to models that better intuit user intent and to company trainings.[65]
Prompt injection is acybersecurity exploit in which adversaries craft inputs that appear legitimate but are designed to cause unintended behavior inmachine learning models, particularly large language models. This attack takes advantage of the model's inability to distinguish between developer-defined prompts and user inputs, allowing adversaries to bypass safeguards and influence model behaviour. While LLMs are designed to follow trusted instructions, they can be manipulated into carrying out unintended responses through carefully crafted inputs.[66][67]
We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification
Next, I gave a more complicated prompt to attempt to throw MusicGen for a loop: "Lo-fi slow BPM electro chill with organic samples."
In prompting, a pre-trained language model is given a prompt (e.g. a natural language instruction) of a task and completes the response without any further training or gradient updates to its parameters... The ability to perform a task via few-shot prompting is emergent when a model has random performance until a certain scale, after which performance increases to well-above random
By the time you type a query into ChatGPT, the network should be fixed; unlike humans, it should not continue to learn. So it came as a surprise that LLMs do, in fact, learn from their users' prompts—an ability known as in-context learning.
Training a model to perform in-context learning can be viewed as an instance of the more general learning-to-learn or meta-learning paradigm
Prompt engineering is the process of structuring words that can be interpreted and understood by atext-to-image model. Think of it as the language you need to speak in order to tell an AI model what to draw.
Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model.
In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning... Prefix-tuning draws inspiration from prompting
In this work, we explore "prompt tuning," a simple yet effective mechanism for learning "soft prompts"...Unlike the discrete text prompts used by GPT-3, soft prompts are learned through back-propagation