Bigspin – The PB&J Problem

There's a YouTube series where a father asks his kids to write instructions for making a peanut butter and jelly sandwich. The kids — maybe 8 and 10 years old — write down what seems to them like clear, complete instructions. The father follows them exactly. He never produces an acceptable peanut butter sandwich.

Not because any single instruction was wrong, but because information was missing.

When the instructions say "spread the peanut butter over the bread" — over which side? The cut side, the crust, or all around it? When they say "put the two pieces together," do you put the peanut butter and jelly sides facing each other or facing out?

The kids are baffled of course. To them, these things are obvious. Of course you spread it on the inside. Of course the fillings go inside when you close the sandwich. Everyone knows that.

But imagine the father isn't just being difficult. Imagine he's a superintelligent being from another planet. He's incredibly capable, far more so than the children, but he has zero prior context for what a PB&J is. He genuinely wants to help. Whatever instructions the children give him, he'll try his absolute best to follow faithfully.

We're the children. The LLM is the superintelligent alien. We're all writing instructions we think are complete for a task that feels obvious. The hard part, we've told ourselves, is just making sure the peanut butter, jelly, and a knife are available.

But still we can't get the superintelligent alien to create PB&Js the way we want.

The obvious response: make the alien smarter. Make sure it consumes everything ever written about sandwiches and every YouTube video, too. Make it even better at reasoning about tools and inferring intent. It'll be able to generalize and won't even need instructions.

Sure, that helps — the alien will know peanut butter goes on the inside, not the crust. But it doesn't solve the core problem. Even with perfect general knowledge about sandwiches, there are countless decisions: crunchy or smooth peanut butter? Thin or thick? Cut in squares or diagonals? Crust on or off?

The alien can't magically guess what these kids want. The kids themselves might not even know what they prefer until they see the wrong version. They've never had to articulate that they only like it cut diagonally, or that the peanut butter needs to go all the way to the edges. These preferences only become visible when violated.

Now scale this up. The father isn't just making sandwiches for his kids at home — he's so good he's running the peanut butter sandwich program for a school cafeteria.

Success doesn't simplify the problem — it multiplies it. More students means more preferences and edge cases to accommodate. Which brands meet the district's nutrition standards? Should sandwiches for elementary vs middle school students be sized differently? What about breakfast vs snack time? What are the peanut safety guidelines you need to adopt to comply with local and school policy?

A superintelligent LLM can know everything about sandwiches, have read every textbook on cafeteria management, can access and reason about the precise legal language for any applicable law — but how does it decide to use this power? Absent a specification, it will guess.

And it won't get it right the first time. Because the answer is specific to this school, these students, and the specific goals the school has for the PB&J program.

That can't be written down on a stone tablet ahead of time. It has to be learned — through operation and through feedback. Through seeing which sandwiches get eaten versus thrown away. Through discovering which accommodations actually matter. Through the continuous feedback loop of running your specific cafeteria with your specific students and watching what actually works.

The question isn't whether you can write perfect specifications upfront. You can't — no one can. The question is whether you're building infrastructure that systematically captures what you learn through operation and turns it into continuous improvement.

Requirements Emerge Through Usage, Not Planning

Right now, teams are choosing between two fundamentally different approaches to AI development. Most don't realize the choice they're making.

One path: treat AI like traditional software. Gather requirements, craft the prompt, deploy, collect feedback in Slack threads and spreadsheets, update the prompt when someone finally has time. Six months in, you have a 3,000-word system prompt that nobody dares touch. Your domain experts are exhausted. The AI works "fine," but it stopped improving months ago.

The other path: build infrastructure that treats every interaction as signal. Quality improvements compound automatically. The system gets smarter every day.

The gap between these two doesn't stay constant. It multiplies — because requirements don't just emerge once. They accumulate and multiply over time.

When it's just you at home making sandwiches for yourself, "peanut butter sandwich" might be enough specification as a starting point. When you're running a cafeteria, you need policies for peanut allergies, preferences for different customer segments, compliance with local food regulations, vendor relationships that determine which ingredients you use.

The same thing happens with AI. When you're building a proof of concept, you might have 10 requirements. Six months later, you have 50. A year later, 200. Most discovered only after seeing the AI in action.

A customer success team deploys an AI to help draft responses. It works fine initially. Then someone notices it's too verbose for mobile users. Then they realize it doesn't adapt tone appropriately for upset vs. curious customers. Then they find it handles the same question differently depending on whether it comes through chat vs. email.

None of these were in the original requirements doc. Not because anyone failed at their job, but because you only discover these things matter when you see the AI running in production.

Teams that treat requirements as fixed are optimizing the wrong thing. They're polishing a prompt based on what they thought mattered in week one, while teams that treat requirements as emergent are learning what actually matters in week twenty. The gap compounds fast.

Six months in, one team is still debating whether their prompt should be more or less verbose. The other team has learned that verbosity doesn't matter—what matters is whether the first sentence directly addresses the user's emotional state. That's not something you could have specified upfront. That's something discovered by watching usage.

But here's what makes this tricky: the person who can see the problem is rarely the person who built the system.

The AI developer sees that the AI followed the prompt. It was concise like you asked. It pulled from the knowledge base like you specified. It formatted the output correctly. Quality looks fine.

The customer support lead looks at the exact same output and sees that it was so concise it came across as dismissive. That it pulled a technically accurate article that doesn't actually help this specific customer. That the formatting is correct but the flow is wrong — it buried the most important information three paragraphs down where frustrated users will never see it.

The developer built exactly what was specified. The support lead is seeing their customer satisfaction scores drop.

You asked the support lead for requirements at the beginning. They gave them to you. You dutifully incorporated them into the prompt. But requirements documents describe what people think matters before they see the system in action. Real usage reveals what actually matters.

The legal expert can tell you they need the AI to flag risky clauses. They can't tell you ahead of time that the AI's threshold for "risky" is too conservative and creating alert fatigue, or that it's missing a specific pattern that only shows up in 2% of contracts but represents 40% of actual legal risk. They discover that when they see it running.

This matters more than most teams realize, because the biggest AI opportunities aren't incremental improvements to existing software—they're entirely new categories that don't exist yet.

AI coaching, for example, isn't just cheaper human coaching — it's a fundamentally different experience. Users engage differently, expect different things, need support in different ways. No amount of human coaching expertise could tell you upfront what makes AI coaching good. You have to discover it.

AI-native customer support isn't just automated ticketing. AI-native legal review isn't just faster contract analysis. AI-native education, sales, operations—these aren't slightly better versions of existing tools. They're new categories with new rules.

For the most transformative applications, nobody knows what good looks like yet. Not you, not your domain experts, not your users. It has to be discovered.

Building Infrastructure for Continuous Learning

The traditional software development model: gather requirements → build the system → deploy.

The AI development model has to be: build the system → deploy → discover requirements through usage → improve the system → discover new requirements → improve again… continuously.

This isn't a failure of requirements gathering. It's a recognition that for AI systems, true specifications can only be discovered through real-world usage, not predetermined through planning.

Every interaction is training data. The moment you put your AI in front of actual users — internal teammates or customers — you unlock something valuable: a direct line between your AI system and what "good" actually means.

You get it explicitly: thumbs up and down, comment feedback, edit behavior. An ops person changes the AI's answer before sending it. A user abandons the flow halfway through. Someone asks the same question three different ways trying to get a useful answer.

You get it latently too: Did the user get what they needed? Did they seem confused or frustrated? What triggered that?

The signal is there. It's being generated with every single interaction. The question is whether you're set up to capture it and learn from it.

The current approach puts impossible burdens on everyone. It asks developers to be domain experts in areas they can't be. It asks domain experts to become prompt engineers when that's not their job. It asks everyone to communicate requirements they don't yet know they have.

The feedback loop exists—users react, experts notice, developers want to fix it—but it runs through Slack messages and spreadsheets and meetings, and by the time it's synthesized into a prompt change, you're three weeks late and the requirements have already evolved again.

The wrong approach: Write a comprehensive prompt. Deploy it. Collect feedback manually in spreadsheets and Slack threads. Update the prompt quarterly when someone finally has time to wade through all the examples.

The right approach: Build infrastructure where every interaction feeds your understanding of quality. Where thumbs down from your power user and confused follow-up questions from casual users both flow into the same system. Where feedback automatically becomes training data. Where the AI continuously learns what "good" means for your specific use case.

Your legal team's specific contract review process isn't in the training data. Your customer support quality standards — the exact balance of warmth and professionalism that works for your brand — aren't in the training data. The particular way your ops team wants the AI to escalate edge cases, the specific domain knowledge that separates good from great in your industry, the subtle judgment calls that your experts make instinctively — none of that is in any training corpus.

The alien can read every sandwich book ever written. But that still doesn’t mean it knows how to run your cafeteria.

The Compounding Advantage

At the beginning, the two teams look similar. Team A has a carefully crafted prompt that works pretty well. Team B has a feedback system that's just getting started.

A few months in, Team B is already noticeably better. They've learned things about their domain that Team A is still discovering through complaints. The AI feels more natural, more helpful, more right to users.

A year in, it's not even close. Team B has incorporated thousands of quality signals. Their AI understands the subtle stuff — when to be formal versus casual, which edge cases actually matter, how to adapt to different user contexts. Team A is still debating whether to add another section to their prompt about tone.

Team A is also moving slower over time. More complexity means more fragility means more risk in every change. Team B is moving faster — more data means better understanding means tighter feedback loops.

This advantage is nearly impossible to copy. Your prompt isn't your moat — it's just today's output of your learning system. It was optimized for your specific use case, within your specific app architecture, using quality signals from thousands of interactions. Tomorrow you'll have a better one incorporating the last 24 hours of feedback.

A competitor who copies your prompt gets a point-in-time snapshot optimized for a context they don't have. The edge cases and subtle quality judgments you've encoded came from seeing 10,000 real interactions. A competitor would need to generate and learn from 10,000 similar interactions to catch up. That's not a technical barrier — it's a time barrier.

The advantage isn’t the list of requirements; it’s the infrastructure that continuously discovers new ones as the product evolves, your users change, and edge cases emerge. It's not that you've discovered what good means today — it's that you have a system that keeps learning what good means tomorrow.

For new categories, the advantage isn't just compounding — it's definitional. The first team to figure out what makes AI coaching genuinely helpful doesn't just build a better product. They discover and encode the quality standard that defines the category. Every competitor afterward is playing catch-up to expectations that team established.

The race isn't to start with the best requirements. It's to learn fastest what "good" means for something that's never existed before. The team that figures it out first doesn't just win that implementation. They define the category. They establish what users come to expect from that type of AI. Everyone else gets measured against the quality bar they discovered.

The learning infrastructure isn't just making you better. It's making you the one who discovers what "better" means.

‍

Moritz Sudhof

CEO + Co-Founder

A love letter to DSPy

Prompting

The PB&J Problem

Requirements Emerge Through Usage, Not Planning

Building Infrastructure for Continuous Learning

The Compounding Advantage

other posts

A love letter to DSPy

Prompt Optimizers, Once Declared Dead, Are The Best Tool for Improving Your GenAI System