Prior to the release of o3 and o4 mini, when using ChatGPT, all “Deep Research” requests would hit the o3 reasoning model. Indeed, for a while, this was the only way o3 could be used.
Today, I noticed that if I fire off a “Deep Research” request while having o4-mini-high
selected and I noticed that, instead of thinking/researching for minutes, it “only” thought for 37 seconds. This is despite having the usual Deep Research follow-up question of clarifying the original request.
Naturally, the number of sources considered was lower and the report generated much less detailed. (FWIW, I’m on the Plus plan — I could imagine one way OpenAI wants to make Deep Research more widely available is to use o4-mini instead of o3, which consumes more GPU. Similarly, I suspect whether the public message that o4-mini is “better than” o3 for coding is really a psyop caused by GPU constraints.)
In this case, in light of the much shorter thinking duration, it was quite clear that this was a “different” Deep Research request than the usual one. However, thinking duration is obviously an unreliable indicator and, as with most things software, the provider could easily change things out while retaining a similar user interface.
More generally, I feel the underlying tension between:
The history of computing has shown that, in the long run, at least for consumer use cases, the latter will dominate, but in the early stages, the builders basically share cultural affinities with the early adopters and will inevitably cater to them (in the same way architects design for other architects, or directors filming for art critics). For API users, this creates a lot of complexity that has to be managed.
Zooming out, I wonder whether these are simply emergent properties of the fact that, right now, the LLMs are just not smart and adaptable enough. A real AGI (among other things and setting aside the P(doom) stuff for now) should surely be malleable enough to do general research in whatever style of our choosing.