Special agent or just a regular model in disguise?

Do AI agents really think and act on their own? In theory, they’re supposed to be autonomous assistants, but in practice, they often turn out to be just wrappers for clever scripts. What’s the reality, then?

Michal Dulemba

If you know and like Marvel movies, you might remember the sarcastic squabbles Iron Man had with the robot that managed his house and armors. I’m sure many people thought then: “That’s the kind of assistant I dream about.” Today’s AI agent, as a substitute in “everyday office work”, is just another version of this idea, though the concept of replacing humans with machines and speeding up work has been around for a long time.

The concept isn’t new, but there’s a ton of new informational chaos. Enthusiastic opinions about AI agents:

They are independent, autonomous and make their own decisions
AI 2.0 (last year it was 1.0, and anything before 2022 is called classic machine learning)
Agent models learn on their own
They are fully personalized

The other extreme, of course, shows a skeptical trivialization of this phenomenon:

“They’re just wrappers for prompts”
Frameworks for agents are too high a level of abstraction
It’s just another name for a regular function calling
It’s another version of RPA (they’re software robots used for business automation)

Because there’s no talk about technology without some technophobia, we might also come across the claim that an AI agent is just another Silicon Valley scheme designed to enslave humanity forever.

What or who is an AI agent?

It depends on who’s asking and – even more – who’s answering. Reading about agents, you can really get lost – the vision of an “AI agent” changes depending on who wrote the article and what kind of automation or assistance is key for them. Of course, another essential element of the narrative is its purposefulness – whether the text promotes a specific product, platform, or provider. In other words, whether it describes reality or tries to sell it to us.

Besides the vision of improvements and benefits that the implementation of a smart agent brings, the concept itself has also become fodder for negative PR during mass layoffs – now agent models do everything, so people aren’t needed. It’s hard to believe that any company could quickly implement such stable and trusted solutions to replace hundreds of people, especially outside of call centers or post-sales services.

Emotions are one thing, but industry jargon is another – the discussion about agents gets even more complicated when a developer talks to a product manager, and a roboticist talks to a cryptocurrency market investor. What’s the difference between their ideas?

Different visions of agents

What’s an AI agent really made of?

Why do people say that an AI agent makes decisions on its own? You can tell from the architecture of the agent-based solution, which is significantly different from using a “bare” model, where the typical flow of information is just question-answer. The agent system includes a loop that checks whether the model (LLM) is already “satisfied” with its own response (if it generates something from its own knowledge) or if it has enough information to generate a satisfying answer based on them. This loop gives it some flexibility in terms of the time needed to come up with the final answer.

Tools

For some time now, most providers of large models offer a feature called “function calling”. This means we can describe a task and send the definitions of program functions in JSON and ask which of these functions should be activated and with what parameters.

In the case of agents, functions are usually called tools – hence the new trend called “tool calling”, which refers to a new – more interesting – mechanism for asking which tool to choose. In this case, the info generated by such a tool almost always ends up back with the LLM, which then decides what to do with it next. Such a tool can be a piece of code, an API request or another process using an LLM. The most commonly used tool is for gathering all kinds of information – from databases, files, RAG systems, and the internet, of course.

Such a subagent usually has access to several tools that are grouped by theme – for example all the apps for downloading files, all the tools for online searching, or links to specific apps like Slack or SalesForce.

Depending on the framework you choose, you can have more or less influence on the prompt that the LLM uses when deciding on tool selection, further actions, or any questions to the user.

After a year full of explosions and the maturation of various frameworks for creating agent models, the whole industry is diving deep into the topic and learning how to approach it – just like in the early days of popular and beloved llamaindex or langchain. This process is kind of like replacing a window on a flying plane with one hand, while at the same time, suppliers keep releasing better models, not to mention the “black swan” situation in the shape of DeepSeek.

In a short time, the following were created:

CrewAI (24k stars),
PhiData (17k stars),
Autogen MS (37k stars),
LangGraph (7.9k stars),
PydanticAI (5k stars),
OpenAI Swarm (for educational use only),
Smolagents (4.5k stars),
Atomic Agents (2.5k stars).

Another very interesting project is memGPT / Letta.ai, which is a combination of a framework in Python with a graphical environment, where you can observe what is happening “under the hood”.

This is just a preliminary list, but it’s already clear that there are different concepts of what an AI agent is, how much freedom of action it should have, and thus how much control users and engineers (its creators) will have over it.

The level of intervention and user needs in this area vary greatly. It’s kind of like with cars – some choose a Toyota Corolla and never change any factory settings, while others spend weeks researching what size and hardness the suspension spring should have in order to guarantee the perfect steering.

What do “memory” and “learning” mean in AI agents?

In the descriptions of agent systems, you will almost always find the statement that these are “learning” systems. But what exactly does that mean? Previously, we could assume that this involved using Reinforcement Learning (RL) models, which try to solve tasks in order to maximize rewards (and minimize penalties) based on a specific reward (and penalty) pattern. Such additional modules or solutions also appear, but they additionally utilize, for example, graph databases.

For people, learning usually means practicing a new skill, memorizing, repeating, looking at results, changing, repeating, memorizing – and so on in a loop. This means that adaptation and memory play a key role in this process. They also have their applications in AI-based systems. At the user level, by remembering important information in the learning process, an LLM:

Understands our preferences (“Do not use slang”, “Always illustrate explanations with examples”, “Do not apologize or make excuses”, “Do not be biased”, “Try to show both sides of the issue”).
Remembers previous interactions (“My name is Michael…”).
Adapts to the style (“use sarcasm and dad jokes often”).
Knows that we often use specific tools, platforms or sources (“Recommend movies from platform A instead of B”).
Remembers our favorite prompts for document processing.

But additionally, we also have the level of work with the entire group of users, i.e., generalized preferences, such as “All salespeople often perform these actions”, “All writers like when their grammar is corrected” (really?), “Everyone on a keto diet looks for simple recipes with a lot of fat”, etc.

Moving a step further, it’s possible to identify certain patterns of interaction with documents or systems, for example: “Using Google Docs, employees of this company most often open document x, while in another they prefer to browse Excel files from this specific folder.”

It’s hard to teach LLM such things, because it’s not really possible to inject such information into the main model’s base training, and fine-tuning every few days wouldn’t make much sense either. Therefore, here’s where all databases, search systems and modern vector bases appear – so that, like in a typical RAG (Retrieval Augmented Generation) system, the final answer was already a product of all filters and preferences stored in the “memory”, that is, in the information and prompts base.

One attempt to address these challenges related to knowledge compilation could be a libraty / API Mem0, like an abstraction layer, behind which we’ll hide vectorial databases like Elastic Search, Pinecone or Qdrant. Mem0 promises to be a “recollection” base with multi-level categories, which later help extract the right information depending on who is searching and what they are looking for. The extracted information will be injected as context into the prompt, making your AI agent respond better, perform tasks more precisely, use fewer tokens on attempts, spend less time searching online, etc. Unfortunately, its open-source option is drastically poorer than the paid API.

Besides, most agent systems also have a “working memory” (often called scratchpad), where the agent keeps the “order” from the user and the planned steps that are supposed to lead to its execution. Depending on the used framework, the current state of the action can also be displayed, which is convenient for debugging errors, but above all, it allows the process to be initiated (again) from a specific point in case of a technical problem with the agent or a server restart. In LangGraph, there is a mechanism called a checkpointer that works similarly to what is used in systems for training models – it saves the current state of “calculations” every certain time or number of steps.

Even if the latest models have a gigantic contextual window, it doesn’t make sense to resend the entire history every time. That’s why such a system (like people during sleep) must get rid of information older than x, less important or outdated in some specific way. It is also necessary to prioritize certain information, resolve many contradictions (or at least inform the user about them), and many other operations that the human mind performs “in the background” based on new information, experiences, diet, or even administered medications – in other words, all new stimuli. A programmer may associate this problem with a graph database as a natural concept for handling such complex relationships. The API Mem0 documentation states that it uses such a solution to organize the information injected by users.

Independence in decision-making or human in the loop?

If we combine the ability to search for information, a loop in which the model can check whether what it has already produced meets the expectations, as well as memory and the building of processes from many smaller elements which one of the subagents can plan to utilize, we can imagine the vast possibilities this combination offers and the associated risks. Therefore, essentially all “engines” have the ability to confirm with the user all decisions regarding the next steps.

I can imagine an AI agent having access to a small amount of money to perform certain tasks on our behalf, but connecting it to the main bank account without any supervision would be too risky. Similarly, when responding to important emails, human oversight over the final shape of the message will most likely save them some embarrassment or even help avoid a major disaster.

PS. In March, OpenAI proposed Computer-Using Agent, which sounds quite sensible. But still, the biggest news this month, at least in the world of agents, is Manus – another Chinese player who has caused quite a stir. It’s great at research: it searches, downloads, summarizes, records, corrects iteratively – and even in Polish! When launched locally, it can supposedly generate 100-page reports in seconds. The possibilities seem truly amazing, and thus raise the bar for the competition once again.