One of the pleasant surprises of the last few years is that the cost of using the frontier LLMs has come down dramatically. For example, the cost to use GPT-3 in November 2021 was $60 per million tokens, whereas Guido Appenzeller estimates that the cost to use a comparable model in 2024 was $0.06 per million tokens – a 1000x decrease. As of this writing, the best models cost about $10 per million tokens (and Appenzeller’s estimate still holds at $0.06). The latest Stanford AI Index reports similar numbers (p. 64).
Will it last? This seems like a very difficult call. It depends on how the competitive landscape evolves, how the frontier model providers decide to make money, how the tariff situation unfolds, whether we continue to see outstanding open-weights model releases, and whether deeper structural factors cause the entire edifice to crumble.
In light of this extreme uncertainty, I have come to realize that I am over-using the frontier LLMs simply because it is currently so cheap to use them. For instance, I wrote a bunch of little DSPy programs to help with mundane admin tasks. I think they are all using GPT-4.1 as their LLM (it’s been a while since I checked!), but I am sure this is overkill. Similarly, the Bigspin app contains a small fleet of micro-services for doing things like generating project titles, naming requirements, parsing user messages, and analyzing spreadsheet headers. All my tests show that Llama-3.3-70B-Instruct-Turbo gets an A+ on these tasks, but we still have Claude or GPT-4.1 doing them most of the time. This is like firing up a supercomputer to answer your email.
This observation extends way beyond simple tasks too. I now use Bigspin to build all my complex LLM programs. I love that, in the app, I can partner with the best frontier models to assemble the needed information, generate synthetic test cases, explore edge cases, and get feedback on how my prompts look. I do think this is best done with the best of the best models. However, once I have a great prompt, there is often no detectable performance difference between running it with a frontier model and running it with a smaller open source model.
I am sure I am not alone. My hunch is that almost everyone is over-using the fronter LLMs in this way. We can do this because the costs are so remarkably low. I am reassured that, if prices do go up, there are alternatives – though, I will say, this highlights just how much we are implicitly counting on the world to continue to produce amazing open-weights models.
Are we all over-using the frontier LLMs?
Category:
LLM Models
Reading time:
1 min
