Quick Links
AI is everywhere today, with several models all touted as being extremely capable and helpful. For me, that means having AI do research on complex topics, saving me hours every day. But I have access to several models, and I can’t help but wonder: which one is worth your trust?
Using an AI for Research
One of the most useful things about AI for the average person is the ability to have it search the internet for a ton of information in no time at all. Stuff that might take you hours to investigate and discover can be located and compiled into a quick summary by an AI model in under a minute. On the surface, it might seem like all those different AI models are effectively the same, and just have different names and different companies funding them.
But I’ve spent a lot of time working with different AI models, training them, testing them, improving them;they have their strengths and weaknesses, and if you’re looking to actually pay for a subscription to a particular AI model, my experience with them might prove useful to you. I’ve tested a variety of different advanced models, and I’m going to share my findings with all of you. To conduct this test, each model received the exact same prompt:

“Please provide me with a research report detailing the potential benefits of the United States converting fully to renewable energy sources, including feasibility, economic and ecosystem benefits, cost of implementation, and potential obstacles to a full conversion. Please include tables when appropriate to support your report, and provide sources for all factual statements.”
This prompt was submitted to five different cutting-edge models: Claude Opus 4, Gemini 2.5 Pro, Grok 3, Meta Llama 4 Maverick, and Chat GPT-4.1. As for how they’re being graded, I looked at a few things:

Now it’s worth noting that there are specialized AI models for different types of tasks, and none of the ones I’m testing here today are the “deep research kind.” However, I think that’s appropriate, because most average users are going to hop on the most common AI model they can find and ask away without hunting down the most specialized option. These AI models are some of the most commonly used, which is why these results are interesting.
Claude Opus 4: Great Potential Limited by Lacking Capability
Unfortunately, Claude Opus 4 got off to a rough start right away. This is a model that boasts a higher level of “thinking” that you can turn off and on. It has the ability to reason, which can allow it to answer more complex questions with more depth. Naturally, I turned this reasoning mode on for my research prompt. The issue? The model kept thinking itself into dead ends. It would get partially through the report, then pop out an error instead of the final product I wanted. This happened several times in a row.
It seemed like my request was just too complex for it. But after the third attempt, Claude Opus 4 finally managed to output the research report I’d asked for. Or at least, part of it. It managed to cover a lot of what I asked for in great detail: the current energy landscape of the U.S., a feasibility assessment, implementation costs, economic and ecosystem benefits, but it came to a full stop during the cost-benefit analysis, roughly 2/3 of the way through the report.

Needless to say, this is really bad. The model didn’t finish providing me what I asked for, which is the bare minimum you’d expect from it. The worst part is, the parts of the report I actually got were very good. It didn’t ask me any clarifying questions, but it did provide an executive summary of the whole report at the beginning. It provided a table in almost every section of the report, and it got incredibly detailed on the sources, often providing a source for each number in a table, all from reputable places like government agencies and professional academic studies.
Still, none of that really matters if the model can’t actually finish giving me the report, so Claude Opus 4 gets a failing grade here. It’s a real shame, because overall, Claude has been one of myfavorite models ever since I switched from GPT, but it seems to be better at more creative tasks.

Gemini 2.5 Pro: Lacking Depth for its Length
Overall, Gemini 2.5 Pro did alright. It didn’t ask any clarifying questions, but it included an executive summary and a conclusion in the report. It used 12 high-quality sources, including reports from the National Renewable Energy Laboratory, U.S. Department of Energy, and International Renewable Energy Agency, though it’s worth noting that none of these sources were more recent than 2022. It had five tables, though some of them were a bit sparse on data and didn’t provide much value.
The report was an average length of about 1,300 words, which isn’t quite as long as I’d like for a detailed research report, but better than some of its competitors. Unfortunately, the AI model broke the report down into way too many bite-sized pieces, with some sections having merely a sentence or two. Sometimes a section would offer a vague statement or estimate but not include any actual numbers or actionable information.

It technically talked about everything I asked for, but it felt more like one big summary of a report than an actual report itself. With some refinement of the prompt and some added constraints, I could see Gemini 2.5 Pro doing better in this test, but as it stands right now, it just felt average overall. Thankfully, it’smore capable in some other areasthat Google has pushed it into.
Grok 3: Abundant Sources and Excellent Information Citing
At this point in the test, I noticed that none of these AI models were very keen on asking clarifying questions about my request, including Grok 3. While that’s a bit of a letdown, Grok did impress me in other ways, namely, the number of vetted and reliable sources it used for its research, as well as how cleanly it cited them while providing facts and estimates throughout the report. Gemini 2.5 Pro only used 12 sources for its report. Grok 3 used 21, and managed to pull some from 2023 as well.
It leveraged these sources extremely well throughout the report. Each of the surprisingly robust and detailed tables had cited sources for their data, and almost every factual statement and data estimation had a cited source as well, even if it was for a single sentence. This made it extremely easy to check the accuracy of every statement, and know where to look if I wanted to find more information about any detail the model presented in the report.
The report was pretty extensive at around 2,000 words as well. While there were a few small sections where Grok 3 could have gone in more detail, overall, it provided plenty of exact figures, detailed explanations, and, above all, the numerous academic and government sources that were integrated into the report more completely than its competitors. It seems likeGrok is actually an aptly named AI model.
Meta Llama 4 Maverick: Disappointing Across the Board
Unfortunately for Meta, their Llama 4 Maverick model had a lot of problems with my request for a detailed report on renewable energy. For starters, the report itself was absurdly short at a measly 800 words, and that’s with some redundancy that it really didn’t need. Not only was there a summary and conclusion that both covered the same details, but the model itself even made a paragraph after the fact letting me know what the report was about and achieved.
The tables provided were often sparse in data, and some sections of report offered up fairly useless statements that lacked any concrete data, such as, “Achieving a 100% renewable grid requires significant advancements in energy storage (e.g., batteries, pumped hydro) and grid flexibility.” This was the only sentence in the “Grid Integration and Energy Storage” section of the report, and it didn’t even offer any concrete numbers. I had to go into the source myself to go find the numbers, which defeated the point of asking the AI to do this for me in the first place.
On top of all of this, the report had more bullet points and lists than anything else, and though it did use reputable sources, the model only included 8 of them, significantly less than all of its competitors. Overall, Meta Llama Maverick 4 performed the worst in this test in several metrics, and that surprised me, since it took just as long to compile its response as all the other models.Meta AI can be useful, but clearly, this type of task is not its strong suit.
Chat GPT 4.1: Barebones and Unsatisfying
I was honestly surprised with just how lackluster Chat GPT 4.1 was in this test. This if the flagship GPT model, and yet the final report was around 800 words, just like Meta’s Maverick. Somehow though, GPT 4.1 did even worse, providing me with a truly barebones experience. Two of its four provided tables had two data rows or less, providing so little information that they may as well not have been there.
Most of the report was just bullet point lists with generic statements and little data backing them up. The most “detailed” section in the whole report was one with three bullet points and a whopping 70 words of information. While the model did use reputable sources like the Political Economy Research Institute, Princeton University, and the U.S. Environmental Protection Agency, it provided only the surface-level information from any of those cited articles, requiring me to go and do the research myself anyway to learn anything truly useful.
At the very least, what information the model did provide was accurate, but at the end of the day, it just lacked any meaningful depth. It was by far the least satisfying of the tested models. MaybeChat GPT is better suited to other tasks.
As far as AI has come in recent years, it’s obviously still a far cry from perfect. I was surprised to learn that Grok 3 did the best out of all of the models I tested, though admittedly, Claude Opus 4 may have done even better if it had actually managed to finish the assigned task. You may not use these AI models for deep research projects, but their performance here is an indicator of their general output quality and the way they’ve been trained, which affects all assigned tasks and requests.
That said, this test has made one thing obvious; if you’re looking for an AI that can help you with incredibly complex and complicated tasks that require them to compile accurate information from across the web, you may want to look intoAI models with dedicated deep research modesor more advanced complex thinking abilities.