Bigspin – Prompt Optimizers, Once Declared Dead, Are The Best Tool for Improving Your GenAI System

Prompt optimizers were first developed in late 2023, and since then they have emerged as the most efficient and effective way to improve GenAI systems. They excel with small datasets and weak supervision signals, they require no updates to the underlying LLM, and they almost invariably outperform fine-tuning methods.

Unsupervised prompt optimizers simply look for patterns in examples, which is an extremely lightweight way to surface latent requirements for a system. Supervised prompt optimizers require somewhat larger and more systematic datasets, but they are still extremely data-efficient compared to fine-tuning.

Despite this incredible track record, prompt optimizers have not been prominent in the AI discourse, and discussions of them have often even been dismissive. This is regrettable, because it has pushed people away from these tools and towards much more expensive and challenging techniques (e.g., online RL fine-tuning methods).

Luckily, attitudes may be shifting. In particular, OpenAI’s DevDay announcements last Monday (October 6) included new tools for prompt optimization, as part of their new AgentKit. During the Latent Space podcast on October 7, Sherwin Wu and Christina Huang acknowledged that this marks a change in perspective when it comes to prompt optimization:

Swix (Latent Space editor): The other thing I think online, I see the developer community are very excited about is automated prompts optimization, which is kind of evals in the loop with prompts. What’s the thinking there? Where’s things going?

Christina Huang (Platform Experience, OpenAI): Yeah. So we have automated prompt optimization. But again, I think this is an area that we definitely want to invest more in.

[…]

Sherwin Wu (Head of Engineering, OpenAI Platform): I actually think it’s a really cool time right now in prompt optimization. […] And it’s interesting because it’s coming at a time when people are realizing that prompt– I feel like two years ago, people were like, oh, at some point, prompting is going to be dead.

Swix: No. It’s going up!

Sherwin Wu: Yeah, yeah, yeah. And if anything, it is like becoming more and more entrenched. And I think that there’s this interesting trend where it’s becoming more and more important, and then there’s also interesting cool work being done to further entrench prompt optimization. And so that’s why I just think it’s a very fascinating area to follow right now. And also is an area where I think a lot of us were wrong two years ago, because if anything, it’s only gotten more important.

For the entire two-year period referred to above, I have been participating in research showing that prompt optimizers are effective. For example, the DSPy paper (October 2023) focused on optimizing few-shot examples, the MIPROv2 paper (June 2024) showed how to jointly optimize instructions and few-shots, and our new GEPA optimizer (July 2025) uses LLM feedback to iterative improve prompts. All these optimizers require no access to model weights and so can be used pretty much anywhere, and they combine nicely with fine-tuning methods (including online RL methods). The Latent Space discussion suggests that GEPA is the one that finally broke through, but prompt optimizers have been posting large gains in academic and real-world settings for this entire period.

So where did the idea come from that prompt optimizers were dead? My leading theory is that the early stories about their successes tended to focus on quirky things they discovered to improve performance. For example, in early 2024, researchers at VMWare reported that the best optimized prompts for their math tasks included information about Star Trek. This led to headlines like “AIs are more accurate at math if you ask them to respond as if they are a Star Trek character – and we’re not sure why”. This does sound like a brittle finding. If you equate this kind of oddity with running a prompt optimizer, you might conclude that they won’t last. There is great evidence that related tricks (“You’re an expert in”, “You’ll go to jail unless…”) have lost much of their value, so this is reasonable. The mistake comes in assuming that this is what prompt optimizers do.

What prompt optimizers actually do is find latent requirements and strategies, leading to much fuller and therefore much more successful system specifications. In case you haven’t seen this in action, check out Figure 2 of the GEPA paper. The starting prompt is “Given the fields question, summary_1, produce the fields query”. The optimized prompt is a detailed specification for how to use search results to break a complex question down into simpler pieces. It fully defines the task, identifies five latent requirements, and specifies an entire strategy for using the available evidence to solve the task. All of this comes from data-driven optimization (no part was written by hand), and it leads to a 22-point performance gain.

I think a second factor has also been working against prompt optimizers: the trend in research towards RL-based post-training methods. These methods use weak behavioral signals to improve AI systems through wide exploration, and they have proven critical in obtaining models that can follow complex and nuanced instructions and use tools effectively. It’s natural to infer that these methods have the potential to do all the work of prompt optimizers and more. What gets elided in these discussions is that RL methods only work if you have outstanding prompts. RL methods explore the space of possible behaviors, but prompts define that space. Bad prompts mean you're exploring the wrong space entirely, and no amount of compute will fix that. In other words, doing RL without great prompts is like bird watching on the moon – very expensive and time-consuming, and, even with the best maps and binoculars, you’re not going to have much luck.

I want to venture a third (and somewhat self-congratulatory) reason for why prompt optimizers have been side-lined until now: it’s hard to invent new prompt optimizers! They are all seeking to do some form of discrete optimization – they are moving through a space of potential texts trying to iteratively find better texts. Whereas we have tons of centuries-old calculus to help us with weight optimization, we have comparatively few techniques for optimizing texts directly. In the hierarchy of learning problems, supervised learning is easiest, then RL, and then this kind of discrete optimization. As a result, even though present-day optimizers like MIPROv2 and GEPA are amazing, they are likely very primitive compared to what we will find in the future. We don’t know the path to improvements though, so the research can feel uncertain. It requires a bold, speculative outlook that might make it seem unsafe as a research direction. RL methods were in the same place until 2022.

Bottom line: this is the perfect moment to try prompt optimizers. Models have gotten dramatically better at following complex instructions, which means there’s more room for optimization to matter. Relatedly, as models have improved, our expectations for GenAI have soared, and manual prompt engineering has no hope of meeting these expectations. In this context, optimization stops being a nice-to-have and starts being essential. Fortunately, the technical overhead is now very low. The open-source DSPy library has made it easy to apply the most powerful optimizers to even very complex systems – and perhaps OpenAI will catch up soon 🙃.

The real challenge is obtaining the right data to optimize against. We all know the sayings: “Garbage in, garbage out”, “There’s no free lunch”, etc. These lessons remain in effect. The critical thing is getting feedback from the right people on the right examples, so that your optimizer has a chance to surface the right latent requirements and strategies. At Bigspin, this collaborative process is our focus. We make the most powerful prompt optimizers available with the click of a button, but the transformative value comes from the expert feedback that flows into that optimization step.

Chris Potts

Co-Founder & Chief Scientist

The PB&J Problem

DSPy

Prompt Optimizers, Once Declared Dead, Are The Best Tool for Improving Your GenAI System

other posts

The PB&J Problem

A love letter to DSPy