What happened in '25 and what are we expecting in '26

Just like every podcast about LLMs this year, I think it would be really rude if we didn't talk about AGI first. Anyway, this topic is, at least for me, not so interesting, or just some human slop. We people say a lot about how LLMs just hallucinate about everything, but the people who hype AGI are the ones generating slop and having hallucinations all the time. Instead of betting on whether AGI will do the greatest things in the world, I only focus on the specific technical approaches to help us scale productivity, so this passage includes my best shot on the approach in which I have a huge belief.

Foundational Models

This year we witnessed a huge leap from DeepSeek R1 to Claude Opus 4.5, which is far beyond most people's imagination. Since last December, I clearly remember people saying we hit a wall at pre-training, or something like "scaling law is dead." But it turns out that if you look at Sonnet 3.5 or GPT-4o from last Dec, they are really hard to use in today's scenarios like coding, creative writing, and so on so forth. So my first takeaway from '25 is that scaling still has so much further to go. We also know that Gemini 3 Pro is a very successful model due to the progress made in pre-training.

Also, it's pretty interesting to find that we are gradually changing the frontier model type from pure reasoning to having the ability to know how to use thinking in proper cases. The famous example is that Claude Opus 4.5 does not perform so well when you turn on the thinking mode in Claude Code. We kind of reached an agreement here that reasoning with tools (aka interleaved thinking) performs far better in agentic tasks than pure thinking.

So in '26, I still have so much confidence that models' agentic abilities will take far more steps than this year. We need a model that can have more proactive abilities, or intentions, or more precisely, curiosity to explore the environment with tools even without a nice system prompt—a born-naive mind to explore everything, a more active personality.

Recently I also re-read the paper on test-time scaling. Just from an outsider in training models, in my perspective, long CoT might not be helpful in agentic tasks, but if we can train the models in a direction where the deep neural network has the ability to sample large amounts of possible outputs or thoughts inside before generating CoT, maybe we can see thinking models find a sweet spot in speed and accuracy.

We can see some frontier labs now having some progress in using sota models to generate synthetic data to train the next generation of sota models, which is good to hear. But we are still focused on the data that are labeled by specialists in some very vertical areas like math and biology. If you ask me why I'm so confident in foundational models having so much more to go—from data to post-train—or just as an agent developer why this is the best thing I'd like to see happen, I bet on that.

LLM Agents

Prompt Engineering

From '22, we found out that different prompts will make agents perform incredibly well in specific tasks, which brought the first wave of ChatGPT Wrappers. Three years later, people still call Cursor a company that only writes prompts. It's so pessimistic to see that happen. Why? Because I think most researchers who claim they fine-tune an open-source model to get strong results in a specific area, it turns out that when you prompt by calling an API of an LLM in the right way, in most cases, it will give way better results than so-called post-training. So prompting is still a hot take in '25 and definitely will still be in '26. Prompting is an art; sometimes you can call it an empirical guess, but from a theoretical aspect, during the training process, the LLM learns a huge amount of perspectives from different people, so you need to give it the right prompt to activate it in order to get the best simulation from a good starting point.

Another thing that makes me devote myself fully to agent design is IMO 25. A UCLA team used Gemini 2.5 Pro by prompting it to get a gold medal, which means the momentum in LLMs is underrated by most of us. So my best take is that prompting will always be the center of the development of agents.

Context Engineering and Skills

These are the two words that became popular in the mid or late '25. In my understanding, it is still heavily based on prompts for sure. But I think the nuance here is that we are kind of turning agent design from a random way to a more progressive way. Although the original SWE-Agent and ReAct papers are from '22-'23, due to the models' abilities, we didn't really take the environment and feedback from it seriously at the time, as we thought the foundation model would solve everything eventually. Context just provides a more continuous way for an agent to be aware of what kind of environment it is in. As the context becomes longer, the model can continuously learn from it (aka in-context learning). I don't want to talk about the details too much, because all kinds of blog posts by Manus or Anthropic are fine-grained, but you cannot just copy them since they kind of skip the path of how they got there. So don't copy; just try it yourself. They only give you some possible ideas that are potentially useful.

Skill, I think, is a way to make API calls have stable performance and also save the context window by executing the scripts in the background. Are they perfect? For sure not, but just like the point I mentioned before, we should consider why the Anthropic team developing Skill.md is way better than just copying it into your agent system.

Memory

If you are an enthusiast of ChatGPT, the most annoying thing is that once you say something, even if you forget, it just somehow deeply plants my random words into its memory (I think ChatGPT somehow can do things that plant an idea just like Inception, lol) and now you are in big trouble. When you say something that only has some tiny connection with it, omg, you're done! The chat will start: "Since you're focusing on… you should…" It destroys the UX totally. Compared to it, Gemini does a little better job, but still, it will sometimes call some memories that I don't want to bring up in this conversation. Long-term memory is still a very tough, unsolved problem in agents. The benchmarks here seem not as trustworthy as SWE-bench or some other benchmarks. In '26, this will be a steep mountain to conquer.

Designing a TRUE Agent System

Multi-agent systems seem to have had a not-so-good year. It's extremely hard to make the system work effectively due to the protocols of AI agent communications. At first, we believed the multi-agent system should be more like a human organization with a manager, programmer, etc. However, none of them performed better than Claude Code's sub-agent design, which I think used the same philosophy in my Verina Search project: agents as tools while sharing the context window with the main agent. So, maybe something that instinctively feels right turns out to be not so good. I mean, these kinds of things happen a lot in agent development. What do you need to do? You should analyze the trace; that's also why I'm definitely against someone who wants to enter the area of agents using LangChain or other highly abstract frameworks.

Designing an agent is far more than just a prompt to inject so the miracle begins; learning how to call APIs effectively is the first dive, and keeping the trace is the last step in the loop to help you continuously improve your system to the next level. Just like many coding agents have some downgrade issues suddenly, so if you can test the trace on a large scale, sometimes you will find a big surprise.

Another consensus we kind of achieved this year is running your agent to develop the next version of it, which is funny since we seem to go back to the loop of using SOTA models to generate synthetic data to train the next generation of SOTA models.

We just talked about scaling in the training session. My another bet is that I definitely have a solid belief that next year the frontier models' cost and the speed of inference will both take a giant step as a lot of compute investment will be done next year. Fast leads to 10x productivity, and cheaper means you can feel free to use test-time scaling to take the best shot at one step, eventually getting the perfect bundle of the result you want. Anything that can be solved by flops is not a problem in my perspective; everything will finally get cheaper in the very near future.

Claude Code and the Future of Coding

If we had a history book on AI, the first "general agent" title I would like to give to Claude Code. Just give it a computer, and the only thing you need to do is prompt several Claude Code in parallel and decompose your long-horizon goals into subtasks to get the best result. This philosophy can be used in any kind of workflow from coding and data analysis to biology, math, and so on so forth. If you use it and deeply integrate it into your workflow, Claude Code will be the best agent you can get in the market. I always wonder why it is so good; the true logic I learned is focusing on your workflow, observing what you lack, and what you can do to streamline the flow, dogfooding your own team firstly and then making some distribution happen. Doing something that you need is the hot take here.

Just like Andrej Karpathy said a few days ago, coding is a highly self-motivated skill right now. As a programmer, I think especially for juniors like me, trying and catching up on the skills we need is the most important thing that makes us valuable in the future. For sure, the coding paradigm will change faster than ever; trying and learning are just the best skills we need to have. There is no absolute answer here; just try it.

Product

I think the chatbot paradigm will finally come to an end in '26. Can you imagine how much time programmers are wasting in prompting? All kinds of products that I can think of are chatbots, from design to coding. A piece of common sense here is that we programmers can spend lots of time considering what prompts are good for performance, but our fathers or mothers, and others who just want to improve productivity—they cannot just spend so much time prompting. A more aggressive way, just like TikTok, is a solidly better way. I just have a wild guess here—it is probably not entirely right—but the core is to design something for the future models' abilities, like how SWE-agent was made using GPT-3.5 (Cursor also started in that age). So, having a larger perspective, instead of waiting until Anthropic pushes out Claude Code to claim your agents defeat it in SWE-bench, create something far beyond the imagination, since you hardly know how dope the model can be in the next few months.

Wrapping Up

'25 is a big year for me. I started to code from scratch. By using Cursor and Claude Code, I made my very own AI Search system, Verina, which I promise will have a major update very soon. I'm grateful for what LLMs and what this era bring to me. My aim is to give back, designing a more aggressive product that benefits mankind. Everyone should enjoy the LLM agent revolution in this decade.

Building something that creates value; building something for the future.