I did a trial run with Deep Research this weekend to do a comparative analysis of the comp packages for Village Managers in suburbs around Chicagoland (it's election season, our VM's comp had become an issue).
I have a decent idea of where to look to find comp information for a given municipality. But there are a lot of Chicagoland suburbs and tracking documents down for all of them would have been a chore.
Deep Research was valuable. But it only did about 60% of the work (which, of course, it presented as if it was 100%). It found interesting sources I was unaware of, and assembled lots of easy-to-get public data that would have been annoying for me to collect that made spot-checking easier (for instance, basic stuff like the name of every suburban Village Manager). But I still had to spot check everything myself.
The premise of this post seems to be that material errors in Deep Research results negate the value of the product. I can't speak to how OpenAI is selling this; if the claim is "subscribe to Deep Research and it will generate reliable research reports for you", well, obviously, no. But as with most AI things, if you get paste the hype, it's plain to see the value it's actually generating.
I'm just realizing this might finally be something that helps me get past analysis paralysis I have before committing to so many decisions online. I always feel like without doing my research, I'll get scammed. Maybe this will help give me a bit more confidence
On the flipside, you might end up getting scammed even worse because of incorrect analysis. For example if ChatGPT hallucinates some data/features through faulty research then you might be surprised when you actually make the decision.
While this will undoubtedly happen, I don't understand why this is a new phenomenon, the internet is filled with data with questionable accuracy. One should always be validating/verifying information even if Deep Research put it together.
I think the difference with Deep Research – and other hallucination and extrapolation-prone research agents – is that without assistance, that verifying synthesized information is much more of a slog than, say, doing your own research and judging the quality of sources as you go, which "deduplicates" querying and verifying.
Of course there are straightforward ways, in terms of UX, to make verification orders of magnitude easier – i.e. inline citations – but TFA argues that OpenAI isn't quite there yet.
Ultimately, if a research agent requires us to verify significant AI synthesized-conclusions as TFA argues, I'd argue research agents actually haven't automated tricky and routine work that keep us thinking about our research at lower level than we would like.
From my experience (having hit the Deep Research quota), I wouldn’t use it to build data tables like the article did, but for qualitative or text-based research, it’s incredibly useful. Useful enough that I’d justify multiple accounts just to increase quota. People keep saying hallucinations but in the reports I have built I have not noticed it to be a problem, again I am not doing quantitative analysis with it.
Yeah, probably true. But if it includes links and sources, at the very least it'll save me some time. I can cross-check faster than I can start the research
I have found it to be exactly this in a lot of cases. It helps answer or synthesize the data that answers questions I had that are good to know but not critical for me to understand.
It's what one imagines the first cars were like - if you were mechanically inclined, awesome. If not, screwed. If you know LLMs and how a basic RAG pipeline works, deep research is wonderful. If not, screwed.
I can't help but feel that it's different if a car runs 90% of the time but breaks down 10% of the time, and if it turns the direction you tell it 90% of the time, but the opposite direction 10% of the time.
If you can't critically read the output of an LLM, you shouldn't be using it for the given task. Many people have made the (good) analogy to an intern.
> As with all LLM output, all of these things are presented in the same fluid, confident-sounding style: you have to know the material already to realize when your foot has gone through what was earlier solid flooring.
I think the "delve" curve shows were already well into "AI papers" stage of civilization. It has probably been tamped down, now; the last thing j heard using it like a tell was notebookllm.
I think of it as the return to 10 blue links. It searches the web, finds stuff and summarizes it so I can decide which links to click. I ignore the narrative it constructs because it’s probably wrong. Which I forgive because it’s the hazmat suit for the internet I’ve always dreamed of.
It gets in the trenches, braves the cookie popups, email signups, inline ads and overly clever web design so I don’t have to. That’s enough to forgive its attempts to create research narratives. But, I hope we can figure out a way to train a heaping spoonful of humility into future models.
> Which I forgive because it’s the hazmat suit for the internet I’ve always dreamed of.
You dreamed of this? Why not dream of a web where you don’t have to brave a veritable ocean of crap to get what you want? It may surprise you to learn such a web existed in the not too distant past.
Of course it is - like I said, this existed not really that long ago. Why do you think it's unrealistic? What's different now? Hint: this erosion started long before generative AI.
I don’t think I could say anything that hasn’t already been written about the causes of this problem. But, I’m glad we have some tools that make things more pleasant while we work it out. Also existence of said tools could provide an incentive to go back to that world. If a website is awful, people will browse it with bots and reader modes, a good incentive for people to make their websites not suck.
I don't know whether to laugh or cry at your apparently sincere belief that spending a trillion dollars to build a fleet of data centers that collectively use more power than Switzerland is a better solution to the enshittification of the web than... not doing that.
Agreed. Maybe we're moving toward a world where LLMs do all the searching, and "websites" just turn into data-only endpoints made for AI to consume. That'll have other big implications... Interesting times ahead.
> and "websites" just turn into data-only endpoints made for AI to consume.
As is already the case with humans, that only serves users to the extent that the websites' veracity is within the intelligence's ability to verify — all the problems we've had with blogspam etc. have been due to the subset of SEO where people abuse* the mechanisms to promote what they happen to be selling at the expense of genuine content.
AI generated content is very good at seeming human, at seeming helpful. A "review website" which is all that, but with fake reviews that promote one brand over the others… a chain of such websites that link to each other to boost PageRank scores… which are then cross-linked with a huge number of social media bots…
Will lead to a lot of people who think they're making an informed choice, but who were lied to about everything from their cornflakes to their president.
* Tautologically, when it's not "abuse", SEO is only helping the search engine find the real content. I've seen places fail to perform any SEO including the legitimate kind.
>A "review website" which is all that, but with fake reviews that promote one brand over the others… a chain of such websites that link to each other to boost PageRank scores… which are then cross-linked with a huge number of social media bots…"
functionally different than a majority of news outlets in the US parroting the exact same story, verbatim?
* please note, i am glibly linking that youtube video as a completely transparent example of what i am talking about. There are many (many) more examples. don't read into the content of the video so much, and just think about the implications.
I think the only difference is one of scale and cost.
AI is automation, and the existence of fully automated propaganda doesn't deny the existence of manual propaganda before it.
> This is extremely dangerous to our democracy
Indeed, the existing manual kind of propaganda is already dangerous. Always was.
Even so, manual propaganda can at least be fought by grass-roots movements, by humans being human. This is why freedom of speech is valuable.
Hard for real humans to counter an AI that can be even moderately conversational at a cost of just USD 0.99/day to fully saturate someone's experience of the world.
Interesting idea. AI can't look at ads, so in the long run ads on informational material might die any you're going back to paying outright. I like it.
Ads? Too obvious. Just always suggest sponsors' products when it's in context, and censor your competitors so they're never mentioned in responses.
And since AI is so great and you don't want to bother going to the website and clicking through a cookie banner, as another commenter mentioned, you just won't ever know the competitor exists
I guess that's technically an ad but it's so much more subversive that "ad" doesn't really do it justice. It'll be like product placement but worse somehow
Have you dealt with people paying for ads? They want your rapt attention, not some possibly subliminal suggestion. The "good" thing about chat AIs is that they can, with a little work, make it nearly impossible to block. You want to use it for free, you get ads.
The ads will just become either bulk misinformation or carefully worded data points that nudge the AI towards a more favourable view of the product being sold.
If you ignore the narrative and only look at the links then you're just describing a search engine with an AI summarization feature. You could just use Kagi and click "summarize" on the interesting results and then you don't have to worry that the sources themselves are hallucinations.
The summaries are probably still wrong but you do you, at least this would save you the step of reading bullshit and boiling a pond to generate a couple links
When ChatGPT came out, one of the things we learned is that human society generally assumes a strong correlation between intelligence and the ability to string together grammatically correct sentences. As a result, many people assumed that even GPT-3.5 was wildly more "intelligent" than it actually was.
I think Deep Research (and tools like it) offer an even stronger illustration of that same effect. Anything that can produce a well-formatted multiple page report with headings and citations surely must be of PhD-level intelligence, right?
Research skills involve not just combining multiple pieces of data, but also being able to apply very subtle skills to determine whether a source is trustworthy, to cross-check numbers where their accuracy is important (and to determine when it's "important"), and to engage in some back and forth to determine which data actually applies to the research question being asked. In this sense, "deep research" is a misleading term, since the output is really more akin to a probabilistic "search" over the training data where the result may or may not be accurate and requires you to spot-check every fact. It is probably useful for surfacing new sources or making syntactic conjectures about how two pieces of data may fit together, but checking all of those sources for existence, let alone validity, still needs to be done by a person, and the output, as it stands in its polished form today, doesn't compel users to take sufficient responsibility for its factuality.
> Are you telling me that today’s model gets this table 85% right and the next version will get it 85.5 or 91% correct? That doesn’t help me. If there are mistakes in the table, it doesn’t matter how many there are - I can’t trust it. If, on the other hand, you think that these models will go to being 100% right, that would change everything, but that would also be a binary change in the nature of these systems, not a percentage change, and we don’t know if that’s even possible.
Of course, humans also make mistakes. There is a percentage, usually depending on the task but always below 100%, where the work is good enough to use, because that's how human labor works.
If I'm paying a human, even a student working part-time or something, I expect "concrete facts extracted from the literature" to be correct at least 99.99% of the time.
You can expect that from a human, but if you don't know their reputation, you'd be lucky with the 85 percent. How do you even know if they understood the question correctly, used trusted sources, correct statistical models etc?
This does not at all resonate with my experience with human researchers, even highly paid ones. You still have to do a lot of work to verify their claims.
Categorically not true and there’s so many examples of this in every day practice that I can’t help but feel you’re saying this to disprove your own statement.
It's absolutely true. Humans misremember details but I'll ask an LLM what function I use to do something and it'll literally tell me a function exists named DoExactlyWhatIWant and humans never do that unless they're liars. And I don't go to liars for information -- same reason I don't trust LLMs.
I don't see a functional difference between misremembering details and lying. Sure one is innocent and one malicious, but functionally a compiler doesn't care if the source of your error is evil or not.
An LLM doesn't know when it lies, but a human also doesn't know when they are (innocently) wrong.
Not to take away from the main point of the article, which is true but:
It seems to be at intern level according to the author - not bad if you ask me, no?
Did he try to proceed as with an intern? ie. was it a dialogue? did he try to drop in this article into prompt and see what comes out?
For skeptics my best advise is – do your usual work and at the end drop in your whole work with prompt to find issues etc. – it will be net positive result, I promise.
And yes they do get better and it shouldn't get dismissed – the most fascinating part is precisely just that – not even their current state, but how fast they keep improving.
One part which always bothers me a bit with this type of arguments – why on earth are we assuming that human does it 100% correctly? Aren't humans also making similar mistakes?
IMHO there is some similarity with young geniuses – they get tons of stuff right and it's impressive however total, unexpected failures occur which feel weird – in my opinion it's a matter of focused training similar to how you'd teach a genius.
It's worth taking step back and recognizing in how many diverse contexts we're using (like now, today, not in 5 years) models like grok3 or claude3.7 – the goalpost seem to have moved to "beyond any human expert on all subjects".
I urge anyone to do the following: take a subject you know really really well and then feed it into one of the deep research tools and check the results.
You might be amazed but most probably very shocked.
In my experience, Perplexity and OpenAI's deep research tools are so misleading that they are almost worthless in any area worth researching. This becomes evident if one searches for something they know or tries to verify the facts the models produce. In my area of expertise, video game software engineering, about 80% of the insights are factually wrong cocktail-party-level thoughts.
The "deep research" features were much more effective at getting me to pay for both subscriptions than in any valuable data collection. The former, I suspect, was the goal anyway.
It is very concerning that people will use these tools. They will be harmed as a result.
Compared to what exactly? The ad-fueled, SEO-optimized nightmare that is modern web search? Or perhaps the rampant propaganda and blatant falsehoods on social media?
Whoever is blindly trusting what ChatGPT is spitting out is also falling for whatever garbage they’re finding online. ChatGPT is not very smart, but at least it isn’t intentionally deceptive.
I think it’s an incredible improvement for the low information user over any current alternatives.
Clearly it is able to solve various logical problems, and therefore an at least imitate logical thought. Is that not reasoning?
And there are plenty of logical problems that many humans can’t solve. Does that mean they’re not capable of reasoning?
At what point would you say something has reasoning? I’d argue that it’s more about how good something is at reasoning, rather than saying it is or isn’t capable of reasoning in absolute terms.
OpenAI knows the tool it markets as “research” does not pass muster. It hallucinates, mid-quotes sources, and does not follow the formal inference logic used in research.
AI slop already produces many plausible-sounding articles used as infotainment and in academia. We already know this slop adds much noise to the signal and that poor signal slows actual research in both cases. But until now, the slop wasn't masquerading specifically as research! It was presented as an assistant, which provides no accuracy guarantees. “Research” by the word’s common meanings does.
This is why it will do harm. There is no doubt in my mind. And I believe OpenAI knows it. They have quite smart engineers, certainly clever enough to figure it out.
If your concern is primarily about researchers in academia using this and believing what it says without skepticism, then higher education has failed them.
And if you think that all published “research” was guaranteed to be accurate before AI tools became available, then I think you should start looking more critically at sources yourself.
I tried to get it to research the design of models for account potential in B2B sales. It went to the shittiest blogspam sites and returned something utterly unimpressive. Instacancelled the $200 sub. Will try it a few more times this month but my expectations are very low.
In my case very "not useful". Background, I am writing a Substack where I write "deep research" articles on autonomous agent tech and explored several of these tools to understand the risks to my workflow, but none of them can replace my as of now.
Deep Research is in its „ChatGPT 2.0“ phase. It will improve, dramatically. And to the naysayers: When OpenAI released its first models, many doubted that it will be good at coding. Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.
Deep research will dramatically improve as it’s a process that can be replicated and automated.
Many past technologies have defied “it’s flattening out” predictions. Look at Personal computing, the internet, and smartphone technology.
By conflating technology’s evolving development path with a basic exponential decay function, the analogy overlooks the crucial differences in how innovation actually happns.
everything you listed was subject to the effects of Moore's Law, explaining their trajectories, but Moore's Law doesn't apply AI in any way. And it's dead.
Tony Tromba (my math advisor at UCSC) used to tell a low key infuriating, sexist and inappropriate story about a physicist, a mathematician, and a naked woman. It ended with the mathematician giving up in despair and a happy physicist yelling "close enough."
> A mathematician and a physicist agree to a psychological experiment.
The mathematician is put in a chair in a large empty room and a
beautiful naked woman is placed on a bed at the other end of the room.
The psychologist explains, "You are to remain in your chair. Every
five minutes, I will move your chair to a position halfway between its
current location and the woman on the bed." The mathematician looks
at the psychologist in disgust. "What? I'm not going to go through
this. You know I'll never reach the bed!" And he gets up and storms
out. The psychologist makes a note on his clipboard and ushers the
physicist in. He explains the situation, and the physicist's eyes
light up and he starts drooling. The psychologist is a bit confused.
"Don't you realize that you'll never reach her?" The physicist smiles
and replied, "Of course! But I'll get close enough for all practical
purposes!"
Is that it? Is it sexist because the physicist and mathematician are attracted to the naked woman?
In my experience people's ideas of "offensive" is all over the map. However, peoples treatment towards accusation of being offensive are all treated equally. i.e. punishment for offending is a binary function of accusation, not a function of the actual offense.
It's this mismatch which has contributed heavily towards society's whiplash over the last decade.
disagree - i actually think all the problems the author lays out about Deep Research apply just as well to GPT4o / o3-mini-whatever. These things just are absolutely terrible at precision & recall of information
I think Deep Research shows that these things can be very good at precision and recall of information if you give them access to the right tools... but that's not enough, because of source quality. A model that has great precision and recall but uses flawed reports from Statista and Statcounter is still going to give you bad information.
Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.
Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.
Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?
Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.
ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.
This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.
Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.
So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.
That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).
Interesting, thanks.
I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.
Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.
This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.
[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]
It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.
Unfortunately that's not how trust works. If someone comes into your life and steals $1,000, and then the next time they steal $500, you don't trust them more, do you?
Code is one thing, but if I have to spend hours checking the output, then I'd be better off doing it myself in the first place, perhaps with the help of some tooling created by AI, and then feeding that into ChatGPT to assemble into a report. By showing off a report about smartphones that is total crap, I can't remotely trust the output of deep research.
> Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.
I don't share this enthusiasm, things are better now because of better integrations and better UX, but the LLM improvements themselves have been incremental lately, with most of the gains from layers around them (e.g. you can easily improve code generation if you add an LSP in the loop / ensure the code actually compiles instead of trusting the output of the LLM blindly).
I agree, they are only starting the data flywheel there. And at the same time making users pay $200/month for it, while the competition is only charging $20/month.
And note, the system is now directly competing with "interns". Once the accuracy is competitive (is it already?) with an average "intern", there'd be fewer reasons to hire paid "interns" (more expensive than $200/month). Which is maybe a good thing? Fewer kids wasting their time/eyes looking at the computer screens?
Everyone who has been working on RAG is aware of how important source control is. Simply directing your agent to fetch keyword matching documents will lead to inaccurate claims.
The reality is that for now it is not possible to leave the human out of research, so I think the best LLM can only help curate sources and synthesize them, but cannot reliably write sound conclusions.
Edit: this is something elicit.com recognized quite early. But even when I was using it, I was wishing I had more control over the space over which the tool was conducting search.
I always wondered, if deep research has an X% chance of producing errors in it's report and you have to double check everything + visit every source or potentially correct it yourself. Then does it really save time in helping you get research done (outside of coding and marketing)? .
It might depend on how much you struggle with writers block. An LLM essay with sources is probably a better starting point than a blank page. But it will vary between people.
This article covers something early on that makes the question of “will models get to zero mistakes” pretty easy to answer: No.
Even if they do the math right and find the data you ask for and never make any “facts” up, the sources of the data themselves carry a lot of context and connotation about how the data is gathered and what weight you can put on it.
If anything, as LLMs become a more common way of ingesting the Internet, the sources of data themselves will start being SEOed to get chosen more often by the LLM purveyors. Add in paid sponsorship, and if anything, trust in the data from these sorts of Deep Research models will only get worse over time.
"Deep research" is super impressive, but so far is more "search the web and surf pages autonomously to aggregate relevant data"
It is in many ways a workaround to Google's SEO poisoning.
Doing very deep research requires a lot of context, cross-checking data, resourcefulness in sourcing and taste. Much of that context is industry specific and intuition plays a role in selecting different avenues to pursue and prioritisation. The error rates will go down but for the most difficult research it will be one tool among many rather than a replacement for the stack.
Watched recent Viva la dirt league videos on how trailers lie and do false promises. Now I see LLM as that marketing guy. Even if he knows everything, he can't help with lying. You can't trust anything he says no matter how authoritative he sounds, even if he is telling the truth you have know way of knowing.
These deep research things are a waste of time if you can't trust the output. Code you can run and verify. How do you verify this.
These days I'm feeling like GenAi is basically an accuracy rate of 95% maybe 96%. Great at boilerplate, great at stuff you want an intern to do or maybe to outsource... but it really struggles with the valuable stuff. The errors are almost always in the most inconvenient places and they are hard to see... So I agree with Ben Evans on this one, what is one to do? the further you lean on it the worse your skills and specializations get. It is invaluable for some kinds of work greatly speeding you up, but then some of the things you would have caught take you down a rabbit hole that waste so much time. The tradeoffs here aren't great.
I think it's not the valuable stuff though. The valuable stuff is all the boilerplate, because, I don't want to do it. The rest, I actually have a stake in not only that it's done, but how it's done. And I'd rather be hands-on doing that and thinking about it as I do it. Having an AI do that isn't that valuable, and in fact robs me of the learning I acquire by doing it myself.
I'll share my recipe for using these products on the off chance it helps someone.
1. Only do searches that result in easily verifiable results from non-AI sources.
2. Always perform the search in multiple products (Gemini 1.5 Deep Research, Gemini 2.0 Pro, ChatGPT o3-mini-high, Claude 3.7 w/ extended thinking, Perplexity)
With these two rules I have found the current round of LLMs useful for "researchy" queries. Collecting the results across tools and then throwing out the 65-75% slop results in genuinely useful information that would have taken me much longer to find.
Now the above could be seen as a harsh critique of these tools, as in the kiddie pool is great as long as you're wearing full hazmat gear, but I still derive regular and increasing value from them.
* Get Perplexity and/or Chatgpt to give feedback on report outline, amend as required.
* Get NotebookLM and Perplexity to each write their own versions of the report one section at a time.
* Get Perplexity to critique each version and merge the best bits from each.
* Get Chatgpt to periodically provide feedback on the growing document.
* All the while acting myself as the chief critic and editor.
This is not a very efficient workflow but I'm getting good results. The trick to use different LLMs together works well. I find Perplexity to be the best at writing engaging text with nice formatting, although I haven't tried Claude yet.
By choosing the NotebookLM sources carefully you start off with a good focus, it kind of anchors the project.
I do that a lot, too, not only for research but for other tasks as well: brainstorming, translation, editing, coding, writing, summarizing, discussion, voice chat, etc.
I pay for the basic monthly tiers from OpenAI, Anthropic, Google, Perplexity, Mistral, and Hugging Face, and I occasionally pay per-token for API calls as well.
It seems excessive, I know, but that's the only way I can keep up with what the latest AI is and is not capable of and how I can or cannot use the tools for various purposes.
I'm not OP but I do similar stuff. I pay for Claude's basic tier, OpenAI's $200 tier, and Gemini ultra-super-advanced I get for free because I work there.
I combine all the 'slop' from the three of them in to Gemini (1 or 2 M context window) and have it distill the valuable stuff in there to a good final-enough product.
Doing so has got me a lot of kudos and applause from those I work with.
Wow, that's eye-opening. So, just to be clear, you're paying for Claude and OpenAI out of your own pocket, and using the results at your Google job? We live in interesting times, for sure. :)
> comparing notes on financial schemes or career growth ideas
I'll admit I'm surprised you need to combine all these LLMs to get a decent result on this kind of queries but I guess you go deeper than what I can imagine on these topics.
So you are basically doing a first pass with diverse models and second pass catches contradictions and other issues? It could help with hallucinations.
"I can say that these systems are amazing, but get things wrong all the time in ways that matter, and so the best uses cases so far are things where the error rate doesn’t matter or where it’s easy to see."
Indeed, the main drawback of the various Deep Research implementation is the quality of sources is determined by SEO, which is often sketchy. Often the user has insight on what the right sources are and they may even be off-line on your computer.
We built an alternative to do Deep Research (https://radpod.ai) on data you provide, instead of relying on Web results. We found this works a lot better in terms of quality of answers as the user can control the source quality.
Deep Research, as it currently stands, is a jack of all trades but a master of none. Could this problem be mitigated by building domain-specific profiles curated by experts? For example, specifying which sources to prioritize, what to avoid, and so on. You would select a profile, and Deep Research would operate within its specific constraints, supplemented by expert knowledge.
The problem with tools like deep research is that they imply good reasoning skills of the underlying technology. Artificial reasoning clearly exists, but is not refined enough to build these kind of technology on top of it. The reasoning is the fundamental of this system and all on top of it gets very unstable.
This is such embarrassing marketing from an organization (OpenAI) which presents itself to the world as a "research" entity.. They could have at least photoshopped the screenshot to make it look like it emitted correct information.
Yes the confidence tbh is getting a bit out of hand. I see the same thing with coding with our SAAS, once the problems get bigger I find myself more often than not starting to code the old way, even over "fixing ai’s code", because the issues are often too much.
Ithink more certainty communication could help. Especially when they talk about docs or 3rd party packages etc. Regularly even Sonnet 3-7 just invents stuff...
I, for one, have it in my prompt that GPT should end every message with a message about how sure it is about the answer, and a rating of "Extremely sure", "moderately sure", etc.
It works surprisingly well. It also provides its reasoning about the quality of the sources, etc. (This is using GPT-4o of course, as it's the only mature GPT with web access)
I highly recommend adding this to your default prompt.
That's its assessment gives me a good picture of which ways to push with my next question, and which things to ask it to use its web search tool to find more information on.
It's a conversation with AI, it's good to know its thought process on how certain it is of its conclusions, as it isn't infallible not is any human.
One other existential question is Simpson's paradox, which I believe is exploited by politicians to support different policies from the same underlying data. I see this as a problem for government especially if we have liberal or conservative trained LLMs. We expect the computer to give us the correct answer, but when the underlying model is trained one way by RLHF or by systemic/weighted bias in its source documents -- Imagine training a libertarian AI on Cato papers -- you have could have highly confident pseudo-intellectual junk. Economists already deal with this problem daily since their field was heavily politicized. Law as well is another one.
I used deep research with o1-pro
to try to fact/sanity check a current events thing a friend was talking about, read the results and followed the links it provided to get further info, and ended up on the verge of going down a rabbit hole that now looks more like a leftist(!) conspiracy theory.
I didn't want to bring in specifics because I didn't feel like debating the specific thing, so I guess that made this post pretty hard to parse and I should have added more info.
I was trying to convey that it had found some sources that, if I came across them naturally, I probably would have immediately recognized as fringe. The sources were threading together a number of true facts into a fringe narrative. The AI was able to get other sources on the true facts, but has no common sense, and I think ended up producing a MORE convincing presentation of the fringe theory than the source of the narrative. It sounded confident and used a number of extra sources to check facts even though the fringe narrative that threaded them all together was only from one site that you'd be somewhat apt to dismiss just by domain name if it was the only source you found.
I did a trial run with Deep Research this weekend to do a comparative analysis of the comp packages for Village Managers in suburbs around Chicagoland (it's election season, our VM's comp had become an issue).
I have a decent idea of where to look to find comp information for a given municipality. But there are a lot of Chicagoland suburbs and tracking documents down for all of them would have been a chore.
Deep Research was valuable. But it only did about 60% of the work (which, of course, it presented as if it was 100%). It found interesting sources I was unaware of, and assembled lots of easy-to-get public data that would have been annoying for me to collect that made spot-checking easier (for instance, basic stuff like the name of every suburban Village Manager). But I still had to spot check everything myself.
The premise of this post seems to be that material errors in Deep Research results negate the value of the product. I can't speak to how OpenAI is selling this; if the claim is "subscribe to Deep Research and it will generate reliable research reports for you", well, obviously, no. But as with most AI things, if you get paste the hype, it's plain to see the value it's actually generating.
>>The premise of this post seems to be that material errors in Deep Research results negate the value of the product
No it’s not. It’s that it’s oversold from a marketing perspective and comes with some big caveats.
But it does talk about big time savings for the right contexts.
Emphasis from the article:
“these things are useful”
I'm just realizing this might finally be something that helps me get past analysis paralysis I have before committing to so many decisions online. I always feel like without doing my research, I'll get scammed. Maybe this will help give me a bit more confidence
On the flipside, you might end up getting scammed even worse because of incorrect analysis. For example if ChatGPT hallucinates some data/features through faulty research then you might be surprised when you actually make the decision.
While this will undoubtedly happen, I don't understand why this is a new phenomenon, the internet is filled with data with questionable accuracy. One should always be validating/verifying information even if Deep Research put it together.
I think the difference with Deep Research – and other hallucination and extrapolation-prone research agents – is that without assistance, that verifying synthesized information is much more of a slog than, say, doing your own research and judging the quality of sources as you go, which "deduplicates" querying and verifying.
Of course there are straightforward ways, in terms of UX, to make verification orders of magnitude easier – i.e. inline citations – but TFA argues that OpenAI isn't quite there yet.
Ultimately, if a research agent requires us to verify significant AI synthesized-conclusions as TFA argues, I'd argue research agents actually haven't automated tricky and routine work that keep us thinking about our research at lower level than we would like.
From my experience (having hit the Deep Research quota), I wouldn’t use it to build data tables like the article did, but for qualitative or text-based research, it’s incredibly useful. Useful enough that I’d justify multiple accounts just to increase quota. People keep saying hallucinations but in the reports I have built I have not noticed it to be a problem, again I am not doing quantitative analysis with it.
Yeah, probably true. But if it includes links and sources, at the very least it'll save me some time. I can cross-check faster than I can start the research
I have found it to be exactly this in a lot of cases. It helps answer or synthesize the data that answers questions I had that are good to know but not critical for me to understand.
It's what one imagines the first cars were like - if you were mechanically inclined, awesome. If not, screwed. If you know LLMs and how a basic RAG pipeline works, deep research is wonderful. If not, screwed.
I can't help but feel that it's different if a car runs 90% of the time but breaks down 10% of the time, and if it turns the direction you tell it 90% of the time, but the opposite direction 10% of the time.
Also that you won't necessarily know when it makes that wrong turn until it's too late (you're in the river now).
If you can't critically read the output of an LLM, you shouldn't be using it for the given task. Many people have made the (good) analogy to an intern.
Similar to what Derek Lowe found with a pharma example: https://www.science.org/content/blog-post/evaluation-deep-re...
> As with all LLM output, all of these things are presented in the same fluid, confident-sounding style: you have to know the material already to realize when your foot has gone through what was earlier solid flooring.
Of which, people will surely die (when it is used to publish medical research by those wishing not to fail the publish-vs-perish situation)
I think the "delve" curve shows were already well into "AI papers" stage of civilization. It has probably been tamped down, now; the last thing j heard using it like a tell was notebookllm.
Deep dive.
"Yes, that's exactly right"
I think of it as the return to 10 blue links. It searches the web, finds stuff and summarizes it so I can decide which links to click. I ignore the narrative it constructs because it’s probably wrong. Which I forgive because it’s the hazmat suit for the internet I’ve always dreamed of.
It gets in the trenches, braves the cookie popups, email signups, inline ads and overly clever web design so I don’t have to. That’s enough to forgive its attempts to create research narratives. But, I hope we can figure out a way to train a heaping spoonful of humility into future models.
> Which I forgive because it’s the hazmat suit for the internet I’ve always dreamed of.
You dreamed of this? Why not dream of a web where you don’t have to brave a veritable ocean of crap to get what you want? It may surprise you to learn such a web existed in the not too distant past.
Your dream wins in a competition of dream quality. But it’s not realistic.
Of course it is - like I said, this existed not really that long ago. Why do you think it's unrealistic? What's different now? Hint: this erosion started long before generative AI.
I don’t think I could say anything that hasn’t already been written about the causes of this problem. But, I’m glad we have some tools that make things more pleasant while we work it out. Also existence of said tools could provide an incentive to go back to that world. If a website is awful, people will browse it with bots and reader modes, a good incentive for people to make their websites not suck.
I don't know whether to laugh or cry at your apparently sincere belief that spending a trillion dollars to build a fleet of data centers that collectively use more power than Switzerland is a better solution to the enshittification of the web than... not doing that.
Agreed. Maybe we're moving toward a world where LLMs do all the searching, and "websites" just turn into data-only endpoints made for AI to consume. That'll have other big implications... Interesting times ahead.
Interesting times, for sure.
> and "websites" just turn into data-only endpoints made for AI to consume.
As is already the case with humans, that only serves users to the extent that the websites' veracity is within the intelligence's ability to verify — all the problems we've had with blogspam etc. have been due to the subset of SEO where people abuse* the mechanisms to promote what they happen to be selling at the expense of genuine content.
AI generated content is very good at seeming human, at seeming helpful. A "review website" which is all that, but with fake reviews that promote one brand over the others… a chain of such websites that link to each other to boost PageRank scores… which are then cross-linked with a huge number of social media bots…
Will lead to a lot of people who think they're making an informed choice, but who were lied to about everything from their cornflakes to their president.
* Tautologically, when it's not "abuse", SEO is only helping the search engine find the real content. I've seen places fail to perform any SEO including the legitimate kind.
a sincere question; how is
>A "review website" which is all that, but with fake reviews that promote one brand over the others… a chain of such websites that link to each other to boost PageRank scores… which are then cross-linked with a huge number of social media bots…"
functionally different than a majority of news outlets in the US parroting the exact same story, verbatim?
This is extremely dangerous to our democracy. https://www.youtube.com/watch?v=_fHfgU8oMSo
* please note, i am glibly linking that youtube video as a completely transparent example of what i am talking about. There are many (many) more examples. don't read into the content of the video so much, and just think about the implications.
I think the only difference is one of scale and cost.
AI is automation, and the existence of fully automated propaganda doesn't deny the existence of manual propaganda before it.
> This is extremely dangerous to our democracy
Indeed, the existing manual kind of propaganda is already dangerous. Always was.
Even so, manual propaganda can at least be fought by grass-roots movements, by humans being human. This is why freedom of speech is valuable.
Hard for real humans to counter an AI that can be even moderately conversational at a cost of just USD 0.99/day to fully saturate someone's experience of the world.
Interesting idea. AI can't look at ads, so in the long run ads on informational material might die any you're going back to paying outright. I like it.
>AI can't look at ads
But ads can be put in AI.
Ads? Too obvious. Just always suggest sponsors' products when it's in context, and censor your competitors so they're never mentioned in responses.
And since AI is so great and you don't want to bother going to the website and clicking through a cookie banner, as another commenter mentioned, you just won't ever know the competitor exists
I guess that's technically an ad but it's so much more subversive that "ad" doesn't really do it justice. It'll be like product placement but worse somehow
>Ads? Too obvious.
Have you dealt with people paying for ads? They want your rapt attention, not some possibly subliminal suggestion. The "good" thing about chat AIs is that they can, with a little work, make it nearly impossible to block. You want to use it for free, you get ads.
Or the content goes away entirely, and the AI has nothing to search.
The ads will just become either bulk misinformation or carefully worded data points that nudge the AI towards a more favourable view of the product being sold.
What do the websites get out of that exchange?
therein lies the problem, and is why Google search didn't disrupt itself until ChatGPT came around.
If you ignore the narrative and only look at the links then you're just describing a search engine with an AI summarization feature. You could just use Kagi and click "summarize" on the interesting results and then you don't have to worry that the sources themselves are hallucinations.
The summaries are probably still wrong but you do you, at least this would save you the step of reading bullshit and boiling a pond to generate a couple links
When ChatGPT came out, one of the things we learned is that human society generally assumes a strong correlation between intelligence and the ability to string together grammatically correct sentences. As a result, many people assumed that even GPT-3.5 was wildly more "intelligent" than it actually was.
I think Deep Research (and tools like it) offer an even stronger illustration of that same effect. Anything that can produce a well-formatted multiple page report with headings and citations surely must be of PhD-level intelligence, right?
(Clearly not.)
To be fair, OpenAI's the one marketing it as such.
In some ways, it's a good tool to teach yourself to sus out the real clues to reliability, not format and authoritative tone.
But that's the thing. The only way to truly find out if it's reliable (>90%) is to check the data yourself.
This is why metrics and leaderboards like these are so important (but under reported on): https://github.com/vectara/hallucination-leaderboard https://www.kaggle.com/facts-leaderboard
Google Gemni models seem to lead...hopefully the metrics aren't being gamed.
That's been every LLM since GPT-2.
computer use big big words ergo computer rull rull smrt
Research skills involve not just combining multiple pieces of data, but also being able to apply very subtle skills to determine whether a source is trustworthy, to cross-check numbers where their accuracy is important (and to determine when it's "important"), and to engage in some back and forth to determine which data actually applies to the research question being asked. In this sense, "deep research" is a misleading term, since the output is really more akin to a probabilistic "search" over the training data where the result may or may not be accurate and requires you to spot-check every fact. It is probably useful for surfacing new sources or making syntactic conjectures about how two pieces of data may fit together, but checking all of those sources for existence, let alone validity, still needs to be done by a person, and the output, as it stands in its polished form today, doesn't compel users to take sufficient responsibility for its factuality.
> Are you telling me that today’s model gets this table 85% right and the next version will get it 85.5 or 91% correct? That doesn’t help me. If there are mistakes in the table, it doesn’t matter how many there are - I can’t trust it. If, on the other hand, you think that these models will go to being 100% right, that would change everything, but that would also be a binary change in the nature of these systems, not a percentage change, and we don’t know if that’s even possible.
Of course, humans also make mistakes. There is a percentage, usually depending on the task but always below 100%, where the work is good enough to use, because that's how human labor works.
If I'm paying a human, even a student working part-time or something, I expect "concrete facts extracted from the literature" to be correct at least 99.99% of the time.
There is a huge gap from 85% to 99.99%.
You can expect that from a human, but if you don't know their reputation, you'd be lucky with the 85 percent. How do you even know if they understood the question correctly, used trusted sources, correct statistical models etc?
This does not at all resonate with my experience with human researchers, even highly paid ones. You still have to do a lot of work to verify their claims.
Sure but there's no step change.
humans often don't trust (certain) other humans too.
but replacing that with a random number(/token) generator is more reliable to someone, then more power to them.
there is value to be had in the output of this tool. but personally i would not trust it without going through the sources and verifying the result.
A human WILL NOT make up non-existent facts, URLs, libraries and all other things. Unless they deliberately want to deceive.
They can make mistake in understanding something and will be able to explain those mistakes in most cases.
LLM and Human mistakes ARE NOT same.
> A human WILL NOT make up non-existent facts
Categorically not true and there’s so many examples of this in every day practice that I can’t help but feel you’re saying this to disprove your own statement.
It's absolutely true. Humans misremember details but I'll ask an LLM what function I use to do something and it'll literally tell me a function exists named DoExactlyWhatIWant and humans never do that unless they're liars. And I don't go to liars for information -- same reason I don't trust LLMs.
Tell me, does an LLM know when it lies?
I don't see a functional difference between misremembering details and lying. Sure one is innocent and one malicious, but functionally a compiler doesn't care if the source of your error is evil or not.
An LLM doesn't know when it lies, but a human also doesn't know when they are (innocently) wrong.
The crucial difference is between misremembering occasionally VS consistently.
You won't give any job to that kind of person.
Non-liar humans do in fact make mistakes.
True, but if they start producing non-existent facts consistently LIKE LLMS. You will be concerned about their mental health.
I insist that human hallucinations are NOT SAME as LLMs.
Not to take away from the main point of the article, which is true but:
It seems to be at intern level according to the author - not bad if you ask me, no?
Did he try to proceed as with an intern? ie. was it a dialogue? did he try to drop in this article into prompt and see what comes out?
For skeptics my best advise is – do your usual work and at the end drop in your whole work with prompt to find issues etc. – it will be net positive result, I promise.
And yes they do get better and it shouldn't get dismissed – the most fascinating part is precisely just that – not even their current state, but how fast they keep improving.
One part which always bothers me a bit with this type of arguments – why on earth are we assuming that human does it 100% correctly? Aren't humans also making similar mistakes?
IMHO there is some similarity with young geniuses – they get tons of stuff right and it's impressive however total, unexpected failures occur which feel weird – in my opinion it's a matter of focused training similar to how you'd teach a genius.
It's worth taking step back and recognizing in how many diverse contexts we're using (like now, today, not in 5 years) models like grok3 or claude3.7 – the goalpost seem to have moved to "beyond any human expert on all subjects".
I urge anyone to do the following: take a subject you know really really well and then feed it into one of the deep research tools and check the results.
You might be amazed but most probably very shocked.
In my experience, Perplexity and OpenAI's deep research tools are so misleading that they are almost worthless in any area worth researching. This becomes evident if one searches for something they know or tries to verify the facts the models produce. In my area of expertise, video game software engineering, about 80% of the insights are factually wrong cocktail-party-level thoughts.
The "deep research" features were much more effective at getting me to pay for both subscriptions than in any valuable data collection. The former, I suspect, was the goal anyway.
It is very concerning that people will use these tools. They will be harmed as a result.
> “They will be harmed as a result.”
Compared to what exactly? The ad-fueled, SEO-optimized nightmare that is modern web search? Or perhaps the rampant propaganda and blatant falsehoods on social media?
Whoever is blindly trusting what ChatGPT is spitting out is also falling for whatever garbage they’re finding online. ChatGPT is not very smart, but at least it isn’t intentionally deceptive.
I think it’s an incredible improvement for the low information user over any current alternatives.
It’s deceptive by design because there is no reasoning, and humans created it and know this.
Clearly it is able to solve various logical problems, and therefore an at least imitate logical thought. Is that not reasoning?
And there are plenty of logical problems that many humans can’t solve. Does that mean they’re not capable of reasoning?
At what point would you say something has reasoning? I’d argue that it’s more about how good something is at reasoning, rather than saying it is or isn’t capable of reasoning in absolute terms.
OpenAI knows the tool it markets as “research” does not pass muster. It hallucinates, mid-quotes sources, and does not follow the formal inference logic used in research.
AI slop already produces many plausible-sounding articles used as infotainment and in academia. We already know this slop adds much noise to the signal and that poor signal slows actual research in both cases. But until now, the slop wasn't masquerading specifically as research! It was presented as an assistant, which provides no accuracy guarantees. “Research” by the word’s common meanings does.
This is why it will do harm. There is no doubt in my mind. And I believe OpenAI knows it. They have quite smart engineers, certainly clever enough to figure it out.
If your concern is primarily about researchers in academia using this and believing what it says without skepticism, then higher education has failed them.
And if you think that all published “research” was guaranteed to be accurate before AI tools became available, then I think you should start looking more critically at sources yourself.
Regardless of the “no true scientist would use it” argument and the argument about what I believe, the fact is that LLM slop is flooding the academia.
AI companies promising their LLMs will now do “research” won’t help.
And research that’s done outside of the academia (like business or independent thinker research) will be more muddied, with more people misled.
Yup none of these tools are actually any close to AGI or "research". They are still a much better search engine and of course spam generator.
I tried to get it to research the design of models for account potential in B2B sales. It went to the shittiest blogspam sites and returned something utterly unimpressive. Instacancelled the $200 sub. Will try it a few more times this month but my expectations are very low.
In my case very "not useful". Background, I am writing a Substack where I write "deep research" articles on autonomous agent tech and explored several of these tools to understand the risks to my workflow, but none of them can replace my as of now.
Murrai Gell-Mann amnesia
Deep Research is in its „ChatGPT 2.0“ phase. It will improve, dramatically. And to the naysayers: When OpenAI released its first models, many doubted that it will be good at coding. Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.
Deep research will dramatically improve as it’s a process that can be replicated and automated.
This is like saying: y=e^-x+1 will soon be 0, because look at how fast it went through y=2!
Many past technologies have defied “it’s flattening out” predictions. Look at Personal computing, the internet, and smartphone technology.
By conflating technology’s evolving development path with a basic exponential decay function, the analogy overlooks the crucial differences in how innovation actually happns.
> Many past technologies have defied “it’s flattening out” predictions.
And many haven't
everything you listed was subject to the effects of Moore's Law, explaining their trajectories, but Moore's Law doesn't apply AI in any way. And it's dead.
I appreciate your style of humor.
Thanks for making my day :)
Tony Tromba (my math advisor at UCSC) used to tell a low key infuriating, sexist and inappropriate story about a physicist, a mathematician, and a naked woman. It ended with the mathematician giving up in despair and a happy physicist yelling "close enough."
(from a sibling's link)
> A mathematician and a physicist agree to a psychological experiment. The mathematician is put in a chair in a large empty room and a beautiful naked woman is placed on a bed at the other end of the room. The psychologist explains, "You are to remain in your chair. Every five minutes, I will move your chair to a position halfway between its current location and the woman on the bed." The mathematician looks at the psychologist in disgust. "What? I'm not going to go through this. You know I'll never reach the bed!" And he gets up and storms out. The psychologist makes a note on his clipboard and ushers the physicist in. He explains the situation, and the physicist's eyes light up and he starts drooling. The psychologist is a bit confused. "Don't you realize that you'll never reach her?" The physicist smiles and replied, "Of course! But I'll get close enough for all practical purposes!"
Is that it? Is it sexist because the physicist and mathematician are attracted to the naked woman?
In my experience people's ideas of "offensive" is all over the map. However, peoples treatment towards accusation of being offensive are all treated equally. i.e. punishment for offending is a binary function of accusation, not a function of the actual offense.
It's this mismatch which has contributed heavily towards society's whiplash over the last decade.
This sounds like a joke with a lot of truth, even if it is offensive.
Can I have a joke?
[dead]
disagree - i actually think all the problems the author lays out about Deep Research apply just as well to GPT4o / o3-mini-whatever. These things just are absolutely terrible at precision & recall of information
I think Deep Research shows that these things can be very good at precision and recall of information if you give them access to the right tools... but that's not enough, because of source quality. A model that has great precision and recall but uses flawed reports from Statista and Statcounter is still going to give you bad information.
Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.
Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.
Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?
Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.
https://www.ben-evans.com/benedictevans/2025/1/the-problem-w...
I have a hunch that's a problem unique to the way ChatGPT web edition handles PDFs.
Claude gets that question right: https://claude.ai/share/7bafaeab-5c40-434f-b849-bc51ed03e85c
ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.
This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.
Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.
I talked about this problem here: https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide....
So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.
That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).
Interesting, thanks. I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.
Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.
This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.
[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]
It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.
Unfortunately that's not how trust works. If someone comes into your life and steals $1,000, and then the next time they steal $500, you don't trust them more, do you?
Code is one thing, but if I have to spend hours checking the output, then I'd be better off doing it myself in the first place, perhaps with the help of some tooling created by AI, and then feeding that into ChatGPT to assemble into a report. By showing off a report about smartphones that is total crap, I can't remotely trust the output of deep research.
> Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.
I don't share this enthusiasm, things are better now because of better integrations and better UX, but the LLM improvements themselves have been incremental lately, with most of the gains from layers around them (e.g. you can easily improve code generation if you add an LSP in the loop / ensure the code actually compiles instead of trusting the output of the LLM blindly).
I agree, they are only starting the data flywheel there. And at the same time making users pay $200/month for it, while the competition is only charging $20/month.
And note, the system is now directly competing with "interns". Once the accuracy is competitive (is it already?) with an average "intern", there'd be fewer reasons to hire paid "interns" (more expensive than $200/month). Which is maybe a good thing? Fewer kids wasting their time/eyes looking at the computer screens?
The interns of today are tomorrow's skilled scientists.
Just FYI: They did roll out Deep Research to those of us on the $20/mo tier at (I think) about the same time you made this comment.
Everyone who has been working on RAG is aware of how important source control is. Simply directing your agent to fetch keyword matching documents will lead to inaccurate claims.
The reality is that for now it is not possible to leave the human out of research, so I think the best LLM can only help curate sources and synthesize them, but cannot reliably write sound conclusions.
Edit: this is something elicit.com recognized quite early. But even when I was using it, I was wishing I had more control over the space over which the tool was conducting search.
Human-in-the-loop (HITL) a buzzword that has become common these days
I always wondered, if deep research has an X% chance of producing errors in it's report and you have to double check everything + visit every source or potentially correct it yourself. Then does it really save time in helping you get research done (outside of coding and marketing)? .
It might depend on how much you struggle with writers block. An LLM essay with sources is probably a better starting point than a blank page. But it will vary between people.
This article covers something early on that makes the question of “will models get to zero mistakes” pretty easy to answer: No.
Even if they do the math right and find the data you ask for and never make any “facts” up, the sources of the data themselves carry a lot of context and connotation about how the data is gathered and what weight you can put on it.
If anything, as LLMs become a more common way of ingesting the Internet, the sources of data themselves will start being SEOed to get chosen more often by the LLM purveyors. Add in paid sponsorship, and if anything, trust in the data from these sorts of Deep Research models will only get worse over time.
"Deep research" is super impressive, but so far is more "search the web and surf pages autonomously to aggregate relevant data"
It is in many ways a workaround to Google's SEO poisoning.
Doing very deep research requires a lot of context, cross-checking data, resourcefulness in sourcing and taste. Much of that context is industry specific and intuition plays a role in selecting different avenues to pursue and prioritisation. The error rates will go down but for the most difficult research it will be one tool among many rather than a replacement for the stack.
> It is in many ways a workaround to Google's SEO poisoning.
But the article goes into exactly how Deep Research fell exactly for the same SEO traps.
Right, hence the term "workaround" not "solution"
Watched recent Viva la dirt league videos on how trailers lie and do false promises. Now I see LLM as that marketing guy. Even if he knows everything, he can't help with lying. You can't trust anything he says no matter how authoritative he sounds, even if he is telling the truth you have know way of knowing.
These deep research things are a waste of time if you can't trust the output. Code you can run and verify. How do you verify this.
These days I'm feeling like GenAi is basically an accuracy rate of 95% maybe 96%. Great at boilerplate, great at stuff you want an intern to do or maybe to outsource... but it really struggles with the valuable stuff. The errors are almost always in the most inconvenient places and they are hard to see... So I agree with Ben Evans on this one, what is one to do? the further you lean on it the worse your skills and specializations get. It is invaluable for some kinds of work greatly speeding you up, but then some of the things you would have caught take you down a rabbit hole that waste so much time. The tradeoffs here aren't great.
I think it's not the valuable stuff though. The valuable stuff is all the boilerplate, because, I don't want to do it. The rest, I actually have a stake in not only that it's done, but how it's done. And I'd rather be hands-on doing that and thinking about it as I do it. Having an AI do that isn't that valuable, and in fact robs me of the learning I acquire by doing it myself.
Yeah but you have 4 to 6 % error that’s not good even if you have dumb computer
I'll share my recipe for using these products on the off chance it helps someone.
1. Only do searches that result in easily verifiable results from non-AI sources.
2. Always perform the search in multiple products (Gemini 1.5 Deep Research, Gemini 2.0 Pro, ChatGPT o3-mini-high, Claude 3.7 w/ extended thinking, Perplexity)
With these two rules I have found the current round of LLMs useful for "researchy" queries. Collecting the results across tools and then throwing out the 65-75% slop results in genuinely useful information that would have taken me much longer to find.
Now the above could be seen as a harsh critique of these tools, as in the kiddie pool is great as long as you're wearing full hazmat gear, but I still derive regular and increasing value from them.
Good advice.
My current research workflow is:
* Add sources to NotebookLM
* Create a report outline with NotebookLM
* Get Perplexity and/or Chatgpt to give feedback on report outline, amend as required.
* Get NotebookLM and Perplexity to each write their own versions of the report one section at a time.
* Get Perplexity to critique each version and merge the best bits from each.
* Get Chatgpt to periodically provide feedback on the growing document.
* All the while acting myself as the chief critic and editor.
This is not a very efficient workflow but I'm getting good results. The trick to use different LLMs together works well. I find Perplexity to be the best at writing engaging text with nice formatting, although I haven't tried Claude yet.
By choosing the NotebookLM sources carefully you start off with a good focus, it kind of anchors the project.
I should also mention that this more 'hands on' technique is good for learning a subject because you have to make editorial assessments as you go.
Maybe good for wider subject areas, longer reports, or where some editorial nuance helps.
> ... perform the search in multiple products
I do that a lot, too, not only for research but for other tasks as well: brainstorming, translation, editing, coding, writing, summarizing, discussion, voice chat, etc.
I pay for the basic monthly tiers from OpenAI, Anthropic, Google, Perplexity, Mistral, and Hugging Face, and I occasionally pay per-token for API calls as well.
It seems excessive, I know, but that's the only way I can keep up with what the latest AI is and is not capable of and how I can or cannot use the tools for various purposes.
This makes sense. How many of those products do you have to pay for?
I'm not OP but I do similar stuff. I pay for Claude's basic tier, OpenAI's $200 tier, and Gemini ultra-super-advanced I get for free because I work there.
I combine all the 'slop' from the three of them in to Gemini (1 or 2 M context window) and have it distill the valuable stuff in there to a good final-enough product.
Doing so has got me a lot of kudos and applause from those I work with.
Wow, that's eye-opening. So, just to be clear, you're paying for Claude and OpenAI out of your own pocket, and using the results at your Google job? We live in interesting times, for sure. :)
No no, that would get me fired.
For which work tasks do you find this workflow useful, given that you can't feed confidential information into the non-Gemini models?
I do that for non-work tasks, like comparing notes on financial schemes or career growth ideas.
For work tasks, there are several different variants of Gemini that are tuned for different things, just as OpenAI has.
> comparing notes on financial schemes or career growth ideas
I'll admit I'm surprised you need to combine all these LLMs to get a decent result on this kind of queries but I guess you go deeper than what I can imagine on these topics.
So you are basically doing a first pass with diverse models and second pass catches contradictions and other issues? It could help with hallucinations.
"I can say that these systems are amazing, but get things wrong all the time in ways that matter, and so the best uses cases so far are things where the error rate doesn’t matter or where it’s easy to see."
That's probably how we should all be using LLMs.
Indeed, the main drawback of the various Deep Research implementation is the quality of sources is determined by SEO, which is often sketchy. Often the user has insight on what the right sources are and they may even be off-line on your computer.
We built an alternative to do Deep Research (https://radpod.ai) on data you provide, instead of relying on Web results. We found this works a lot better in terms of quality of answers as the user can control the source quality.
Deep Research, as it currently stands, is a jack of all trades but a master of none. Could this problem be mitigated by building domain-specific profiles curated by experts? For example, specifying which sources to prioritize, what to avoid, and so on. You would select a profile, and Deep Research would operate within its specific constraints, supplemented by expert knowledge.
The problem with tools like deep research is that they imply good reasoning skills of the underlying technology. Artificial reasoning clearly exists, but is not refined enough to build these kind of technology on top of it. The reasoning is the fundamental of this system and all on top of it gets very unstable.
This is such embarrassing marketing from an organization (OpenAI) which presents itself to the world as a "research" entity.. They could have at least photoshopped the screenshot to make it look like it emitted correct information.
> they don’t really have products either, just text boxes - and APIs for other people to build products.
Isn't this a very valuable product in itself? Whatever happened to the phrase "When there is a gold rush, sell shovels"?
Two factors to consider: human performance and cost.
Plenty of humans regularly make similar mistakes to the one in the Deep Research marketing, with more overhead than an LLM.
The thig is, if you look at all the "Deep Research" benchmark scores, they never claim to be perfect. The problem was plain to see.
Yes the confidence tbh is getting a bit out of hand. I see the same thing with coding with our SAAS, once the problems get bigger I find myself more often than not starting to code the old way, even over "fixing ai’s code", because the issues are often too much.
Ithink more certainty communication could help. Especially when they talk about docs or 3rd party packages etc. Regularly even Sonnet 3-7 just invents stuff...
What a beautiful website
I, for one, have it in my prompt that GPT should end every message with a message about how sure it is about the answer, and a rating of "Extremely sure", "moderately sure", etc.
It works surprisingly well. It also provides its reasoning about the quality of the sources, etc. (This is using GPT-4o of course, as it's the only mature GPT with web access)
I highly recommend adding this to your default prompt.
> It works surprisingly well
What do you mean by this exactly? That it makes you feel better about what its said, or that its assessment of its answer is actually accurate?
That's its assessment gives me a good picture of which ways to push with my next question, and which things to ask it to use its web search tool to find more information on.
It's a conversation with AI, it's good to know its thought process on how certain it is of its conclusions, as it isn't infallible not is any human.
One other existential question is Simpson's paradox, which I believe is exploited by politicians to support different policies from the same underlying data. I see this as a problem for government especially if we have liberal or conservative trained LLMs. We expect the computer to give us the correct answer, but when the underlying model is trained one way by RLHF or by systemic/weighted bias in its source documents -- Imagine training a libertarian AI on Cato papers -- you have could have highly confident pseudo-intellectual junk. Economists already deal with this problem daily since their field was heavily politicized. Law as well is another one.
I've never thought of Simpson's Paradox as a political problem before, thanks for sharing this!
Arguably this applies just as well to Bayesian vs Frequentist statisticians or Molecular vs Biochemical Biologists.
What examples do you have for Bayes vs freq; and molecular vs biochem?
I am currently in India in a big city doing yoga for the first time as a westerner.
I dont Google anything. Google maps, yeah. Google, no.
Everything I want to know is much better answered by ChatGpt Deep Research.
Ask a Question, Drink a Chai, Get a Great, Prioritised, structured Answer without spam or sifting through ad ridden SEO pages.
It is a game changer, and at one point the will get rid of the "drink a chai" wait and it will kill the Google we know now.
I used deep research with o1-pro to try to fact/sanity check a current events thing a friend was talking about, read the results and followed the links it provided to get further info, and ended up on the verge of going down a rabbit hole that now looks more like a leftist(!) conspiracy theory.
I didn't want to bring in specifics because I didn't feel like debating the specific thing, so I guess that made this post pretty hard to parse and I should have added more info.
I was trying to convey that it had found some sources that, if I came across them naturally, I probably would have immediately recognized as fringe. The sources were threading together a number of true facts into a fringe narrative. The AI was able to get other sources on the true facts, but has no common sense, and I think ended up producing a MORE convincing presentation of the fringe theory than the source of the narrative. It sounded confident and used a number of extra sources to check facts even though the fringe narrative that threaded them all together was only from one site that you'd be somewhat apt to dismiss just by domain name if it was the only source you found.