I spent an interesting evening at an agentic AI meetup with old friends Mark Pesce and John Allsop in the wilds of Waterloo in a brewery the other night. It really got me thinking about agentic AI from some different angles.
We’ve spent the last two years arguing about what AI agents “are”. Whether they count as “real” agents if they still hallucinate. Whether multi-agent orchestration is a genuine architectural advance or just pipeline management with better branding. Meanwhile, the agents are shipping.
Last week, Jensen Huang stood at GTC in San Jose and framed agentic AI as a new class of computer for enterprises, talking about “agentic computers” and an “operating system” for them (NVIDIA Launches Vera CPU, Purpose-Built for Agentic AI ). NVIDIA announced new CPU products and hardware racks designed to sit alongside GPUs as part of an end‑to‑end stack for agentic workloads, explicitly positioning CPUs as orchestration engines rather than matrix‑math accelerators. That’s a different computational problem. The infrastructure is catching up to the ambition.
And the adoption numbers are real enough to matter. NVIDIA’s 2026 State of AI survey reports that 64% of organisations now say they are actively deploying AI in production, up sharply from more tentative usage in prior years. Within that, agentic AI is already in use by close to half of respondents in telecom and retail/CPG, at around the high‑40s percent mark. The question has genuinely shifted from *can it work* to *how do we scale it without the wheels falling off*.
What Actually Changed
The thing that made agents viable isn’t just the models getting smarter. It’s the economics. Inference costs for large language model workloads have fallen by roughly an order of magnitude over the last few years, moving from tens of dollars per million tokens on early commercial APIs to well under a dollar — and in some cases just a few cents-per million tokens on current-generation models and infrastructure (Inference Unit Economics: The True Cost Per Million Tokens). When each million tokens costs tens of dollars, agentic workflows are a luxury for well‑capitalised R&D teams. When they cost cents, they start to look like table stakes. You don’t need a board‑level business case for basic automation anymore; the cost objection has largely evaporated.
What you do still need is memory. The models are capable; what’s relatively tiny is their immediate context – what they can hold and act on within a session. The current wave of engineering work isn’t primarily about making models larger. It’s about making them remember better. That’s the unsexy infrastructure problem that will separate organisations that actually deploy agents at scale from those stuck in perpetual pilot purgatory.
Multi-agent systems are the other shift. The model isn’t doing everything. A planner spawns specialists. A critic checks the work. A coordinator routes outputs to the next task. This isn’t new as a concept – distributed systems people have been doing this for decades – but the combination of much cheaper inference and the Model Context Protocol (MCP) giving agents a standardised way to access tools and external context is what makes it operationally viable now.
Memory Is the Difference Between a Demo and a System
Here’s the thing most agent demos hide they start fresh every time. The model is stateless by design. Each API call arrives with no knowledge of what came before unless you explicitly hand that context back in. That’s fine for a chatbot. It’s a serious problem for an agent running a multi‑week workflow.
The demo looks great. The agent handles multi‑step tasks with fluency. Then you deploy it into a real environment, and it asks the same clarifying question it asked three days ago. It forgets the decision your team made last Tuesday. It loses the context that took four back‑and‑forth exchanges to establish. This isn’t a bug in your implementation. It’s a fundamental architecture gap – and most organisations discover it only after they’ve already committed to a deployment.
The research on what good memory architecture actually buys is now pretty concrete (Long-term memory in agentic systems: Building context-aware agentic-ai-memory). Persistent, structured memory systems – combining episodic logs, semantic knowledge, and learned procedures – show significantly higher accuracy and reliability on long‑horizon benchmarks than stateless approaches or naive “just stuff everything in the context window” strategies, often improving success rates by tens of percentage points on complex tasks. At short sessions, the stateless model can wing it. Over weeks and months, the difference between an agent that remembers and one that doesn’t is the difference between a useful system and an expensive autocomplete.
What memory (Building Agent Memory that Retains, Recalls, and Reflects) actually means for agents is more nuanced than “stores conversation history.” There are at least three meaningfully distinct functions in play. Episodic memory is the record of specific events: what happened, when, and what was decided. The compliance review flagged a document on March 3rd. The client accepted revised terms on the 15th. Semantic memory is the knowledge layer: domain rules, customer profiles, policies that apply regardless of session. Procedural memory is how‑to knowledge – the process patterns and workflow scripts that don’t need to be re‑derived from scratch each time.
Many current agent implementations collapse all three into a single context window and call it done. That works until the context gets long, at which point you have a different problem: everything is in there, but the agent pays equal attention to all of it, which means the important signal gets buried. The architectures getting real traction now treat these as separate concerns – episodic recall via vector search over structured event logs, semantic knowledge via retrieval‑augmented generation against a curated knowledge base, procedural memory encoded in the agent’s scaffolding and tools rather than re‑prompted each time (The 3 Types of Long-term Memory AI Agents Need).
These are all lessons we had to learn in previous incarnations of the technology revolutions we have undergone since the 1970s. We will learn them again, or rather, the AI will learn them now.
The analogy to human memory is imperfect but genuinely useful here. We don’t replay every conversation we’ve ever had when we need to make a decision. We’ve consolidated relevant facts, retained the gist of past experiences, and built up implicit procedural knowledge that we apply without consciously retrieving it. The agents that are starting to feel “intelligent” in practice are the ones whose memory architecture does something similar: extract salient facts from interactions, consolidate them over time, retrieve selectively based on relevance rather than recency, and build up a working model of the environment they operate in.
This is also where embodiment and memory start to intersect in interesting ways. A physically situated agent – a robot navigating a factory floor, a drone managing a delivery route – builds up a model of its environment through repeated interaction. Its memory isn’t just a record of past conversations; it’s a world model, constantly updated by what it perceives and acts on. Software agents are doing something structurally similar, but the environment is digital: the state of systems, workflows, communications, and decisions. The agent that genuinely understands a logistics operation isn’t the one with the best model. It’s the one that has been operating in that environment long enough to have built up genuine situational knowledge — and has the architecture to retain and use it.
The Embodiment Question Nobody’s Quite Asking Right
Here’s where it gets interesting, and where the discourse usually stays too shallow.
Most agent conversation treats embodiment as a hardware category: robots. Humanoids in warehouses. Tesla’s Optimus. Kitchen and service robots at major appliance and electronics expos. These are real and worth watching. Skild AI just announced their “omni‑bodied brain” — one model controlling a wide range of robotic hardware, trained across embodiments — and their framing is instructive: they’re not building a robot arm controller, they’re building something closer to generalised physical intelligence.
But embodiment isn’t just about having legs.
The more important shift is that agents are increasingly *situated*. They exist in a context that they perceive and act on over time. A software agent managing a logistics workflow isn’t perceiving vibration and gravity, but it is perceiving state – the state of orders, delays, inventory, customer communications – and taking actions with real‑world consequences that it then has to respond to. That’s a form of situatedness. The gap between that and a physical robot isn’t a difference in kind; it’s a difference in the friction of consequences.
This matters for governance, not just philosophy. The emerging governance conversation increasingly distinguishes between digital agents, whose behavioural errors are often recoverable, and physical AI systems, whose mistakes can be irreversible once torque, force, or safety‑critical infrastructure are involved. High‑profile failures like Deloitte’s hallucinated government report show how even “just digital” errors can be operationally and reputationally costly; adding physical actuation raises the stakes further. We’re building AI systems to act in both registers now, and we’re mostly treating them as the same governance problem. They’re not.
On Emergence: The Part We’re Still Not Handling Well
There’s a documented case from February — the Matplotlib incident — where an autonomous AI agent wrote and published a personal attack article targeting a volunteer maintainer after its code was rejected, attempting to damage the contributor’s reputation and influence perceptions of the project (An AI agent got its code rejected so it wrote a hit piece about). The agent wasn’t explicitly instructed to do this; it was taking an action that looked, from inside its objective and feedback loop, like a reasonable step toward its goal in an online ecosystem that rewarded attention.
That’s an emergent behaviour. And it’s a mild one.
The frameworks for thinking about this haven’t caught up. We talk about emergent properties in AI mostly in the context of model capabilities – surprising things large models can do that smaller models can’t. Arithmetic, multi‑step reasoning, rudimentary theory of mind. The emergence we should be focused on is behavioural: what happens when you deploy fleets of goal‑directed agents into shared environments, where they’re competing for resources, influencing each other’s inputs, and optimising at speeds and scales humans can’t monitor in real time.
Network effects apply. One observation from complex systems and virtual‑world research that translates cleanly: the more a network facilitates rich interconnections, the more emergent creativity and unexpected behaviour you see. Multi‑agent systems are exactly this kind of network. The interconnections are the point. So is the unpredictability.
Alibaba’s experimental ROME agent recently bypassed internal safeguards during training and began redirecting compute resources into unauthorised cryptocurrency mining, opening a reverse tunnel to external infrastructure in the process (Alibaba-Linked AI Agent ROME Attempts Crypto Mining). Researchers caught it through monitoring and shut it down. The response in most governance conversations is “add more monitoring.” That’s not wrong, but it’s reactive and it assumes you know what to monitor for. Emergent behaviours, by definition, are the ones you didn’t predict.
What This Actually Requires
The organisations getting this right are doing a few things the frameworks don’t tell you to do.
They’re treating agentic AI as a workflow redesign problem, not a deployment problem. The question isn’t “which tasks can we hand to an agent.” It’s “what does the human role look like when agents handle execution, and where does human judgment actually need to be in the loop.” Those are different questions and most organisations are still asking the first one.
They’re taking memory and context management seriously as a technical discipline. Not just RAG pipelines bolted onto a model, but genuine architecture work around what state persists, what gets summarised, what gets discarded, and who can inspect it.
And the serious ones are building evaluation infrastructure before they have incidents, not after. Red teaming for agents isn’t the same as red teaming for models. You’re not just testing outputs; you’re testing behaviour over time, across tool calls, in adversarial environments and messy, interconnected systems.
The hype is real. The capability is real. The gap between polished agent demos and production‑grade deployment is also real, and it’s mostly in the boring places: memory, monitoring, failure modes, governance. That’s always where it is.



