Manual prompt engineering has no real feedback loop. You iterate by feel, test on a handful of examples, and hope it generalises. The result lives in someone's head or a shared doc — not in a versioned, auditable system. There is a better way.

The problem with manual prompt iteration

If you have spent time doing serious prompt engineering, you know the pattern. You write a prompt. You test it on five or ten examples you happen to remember. It works. You ship it. Three weeks later you get a bug report: the model is doing something strange on an input type you did not think to test. You fix the prompt. Something else breaks.

This is not incompetence — it is the nature of the process. Manual prompt iteration has three structural problems:

  • No systematic eval. You test against the examples you think of, not a representative sample of what the model will actually see in production.
  • No search. You are exploring the prompt space by intuition. The space is enormous. You will not find the optimal prompt this way — you will find a prompt that is good enough, given the time you have.
  • No versioning. The prompt lives in a config file, a doc, or someone's memory. There is no record of why it changed, what it was before, or whether the new version is actually better on a held-out test set.

Treating prompt optimisation as search

GEPA — reflective genetic-Pareto optimisation — reframes prompt engineering as a search problem. The core loop is straightforward:

  1. Generate candidates. Start from an initial prompt and produce a population of variants — different phrasings, different instruction structures, different examples included or excluded.
  2. Score against evals. Run the candidate prompts against a held-out eval set drawn from your production traces. Score each output on one or more metrics — accuracy, format compliance, latency, whatever matters for your task.
  3. Select the Pareto front. Rather than optimising for a single metric, keep the candidates that are not dominated on any combination of metrics. This gives you a frontier of trade-offs rather than a single brittle winner.
  4. Mutate and recombine. The surviving candidates are the parents of the next generation. Generate new variants by combining their best elements and introducing controlled mutations.
  5. Repeat. After several generations, the search converges on prompts that are genuinely better than what you would have found by hand — and you have evidence to prove it.

The "reflective" part of GEPA adds a self-critique step: a second model pass reviews why a candidate failed and suggests targeted mutations based on that analysis, rather than random perturbation alone. This dramatically accelerates convergence.

How DSPy fits in

DSPy provides the underlying framework that makes this practical. It treats prompts not as raw strings but as typed, composable programs — which means the optimisation loop can operate on structured components rather than opaque blobs of text. The evaluation harness, the metric definitions, and the candidate generation all plug into DSPy's abstractions.

The practical effect: you define what "good" means (the eval), and the system searches for prompts that achieve it. You are no longer in the loop for each iteration — you review the results at the end and decide whether to ship.

What this costs and what you need

An optimisation run costs roughly $2–10 in API calls. No GPU required — the search runs entirely through API calls to the model you are already using. The main input is a production trace: real inputs and the outputs your system produced, along with a scoring function that can evaluate whether each output was good.

You do not need thousands of examples to start. A few hundred representative inputs — especially ones that cover your known failure modes — is enough to run a meaningful optimisation. The eval set is the most important thing you will build. More on that in a moment.

What you get at the end

The output is not just a better prompt. It is a versioned artifact: the optimised prompt, the eval set it was scored against, the metrics it achieved, and enough metadata to reproduce the run. This goes into your artifact registry alongside your model checkpoints and RL policies.

When a new model version drops — from your vendor, or because you have fine-tuned a new checkpoint — you can re-run the optimisation loop against the new model in hours. The eval set carries over. The baseline comparison is automatic. You are not starting from scratch; you are re-searching a space you already understand.

The real asset is the eval set

Here is the insight that changes how you think about this work: the prompt is not the asset. The prompt will be superseded. The model it was optimised for will be deprecated. What you are really building, over time, is the eval set — the curated collection of inputs, expected outputs, and scoring criteria that define what "good" means for your task.

The eval set is the asset that outlives every model version. It encodes your team's judgement about quality in a form that machines can run against any future model.

An eval set that accurately reflects production distribution — including the edge cases, the tricky inputs, the outputs that look right but are subtly wrong — is genuinely hard to build. Once you have it, re-optimising prompts becomes a commodity operation. The knowledge is in the eval, not in the prompt.

This is why manual prompt engineering, however skilled, cannot compound the way systematic optimisation can. The prompt changes; the eval accumulates. After a year of optimisation cycles, the eval set is a precise, machine-executable specification of your quality bar. The prompt is whatever configuration happened to score best against it last week.

If your team is currently iterating prompts by hand and feeling the friction, get in touch — the shift to systematic optimisation is usually faster to set up than people expect, and the improvement in both prompt quality and iteration speed is significant.