When ChatGPT launched in late 2022, "prompt engineering" was a collection of community-discovered tricks: say "think step by step," pretend you're a character, use magic phrases. By 2026, the discipline has been studied rigorously, with published research from academic labs and engineering teams at every major AI company. Some of the original tricks held up under scrutiny. Many didn't. Here's what the evidence actually says.
What Consistently Works
Chain-of-Thought (CoT) prompting: Instructing the model to "think through this step by step" or providing example reasoning traces before the target problem remains one of the most well-validated techniques. A landmark 2022 paper from Google Brain showed CoT prompting improved performance on math word problems by 40–100% depending on model size. The effect is largest for complex reasoning tasks and smaller models — frontier models like GPT-5 now do CoT internally, reducing the need for explicit prompting.
Few-shot examples: Providing 3–5 examples of the input-output format you want remains highly effective for format-sensitive tasks. The examples implicitly communicate constraints that are difficult to state explicitly. Research shows that the quality of examples matters more than the quantity — 3 great examples beat 10 mediocre ones.
Explicit output format specification: Telling the model exactly what format you want the output in — JSON schema, specific section headers, word count limits — dramatically reduces the post-processing you need to do. Include a concrete example of the expected output format when possible.
Role assignment (with caveats): "You are an expert financial analyst" style role prompts improve performance on domain-specific tasks, but the effect is smaller than early community enthusiasm suggested. Recent research shows the benefit comes from the role activating relevant knowledge, not from any psychological "persona" effect. You don't need elaborate role descriptions — simple domain specification ("Answer as an experienced software architect") is usually sufficient.
What Doesn't Work As Advertised
"Please" and emotional appeals: Adding "please," "I really need this," or threatening the model ("my career depends on this") may have minor effects on some models, but there's no consistent evidence these techniques reliably improve output quality. Skip them.
Extremely long system prompts: Research from Anthropic and others shows that models begin to "lose" information buried in very long context windows — a phenomenon called "lost in the middle." System prompts longer than 2,000 tokens often see diminishing returns, and key instructions should be placed at the beginning or end of the prompt, not buried in the middle.
Jailbreaking-style framing for safety-related tasks: Techniques that try to "trick" models into ignoring safety training (DAN prompts and similar) are increasingly ineffective on modern frontier models and create unreliable outputs even when they "work."
The Emerging Discipline: Prompt Testing
The most important shift in production prompt engineering over the past two years is the move toward systematic evaluation. Rather than relying on intuition, serious teams build eval suites — sets of test inputs with expected outputs — and run candidate prompts against these evals before deployment. Tools like PromptFoo, Braintrust, and LangSmith have made this workflow accessible without a dedicated ML research team. If you're iterating on prompts for a production AI system without a structured eval process, you're flying blind — and the data shows it costs 30–50% more in debugging time than building evals upfront.