AI agents that do your work while you sleep sound great. The reality is far messier—‘it’s like a toddler that needs to be overseen’

2026-02-23 20:06:54

Summer Yue may work on safety and alignment on Meta’s superintelligence team, but even she admits she isn’t immune to overconfidence when it comes to autonomous AI agents.

AI agents work best when tasks are simple and low-stakes

Shyamal Anadkat, who previously worked as an applied AI engineer at OpenAI, said most of today’s successful agents still require frequent human check-ins or are limited to tightly bounded, well-defined tasks—though he emphasized that this will change as measurement and evaluation techniques improve.

“A system that’s 95% accurate on individual steps becomes chaotic over a 20-step autonomous workflow,” Anadkat said. “Long-horizon planning is still weak.” As a result, he explained, agents may perform well on short task chains but tend to fall apart when asked to manage complex, multi-day projects. Memory is another major limitation: “In many agents, memory is either nonexistent or fragile. You need systems that can maintain a coherent model of your work context, priorities, and constraints.”

That doesn’t mean the promise of AI agents is all smoke and mirrors, according to Yoav Shoham, a former principal scientist at Google, a professor emeritus at Stanford and co-founder of AI21 Labs. But it does mean there is the danger of people getting ahead of themselves. Today’s AI agents, he explained, work best when the task is low-risk, loosely defined, and cheap to get wrong.

“Developers like toys, and you have this toy that can do wonderful things,” he told Fortune. “As long as what they’re doing is fairly simple and fairly low stakes with high tolerance for error, that’s fine.” For example, if you wanted your agent to read 10,000 websites and do something interesting with the results to give you tidbits of information overnight that could be useful.

But for mission-critical enterprise workflows, the bar is much higher. Companies need systems that are verifiable, repeatable, and cost-effective—requirements that quickly erode the set-it-and-forget-it promise of fully autonomous, always-on agents. In highly structured domains like coding or math, deeper automation is already possible. But for most real-world business processes, Shoham says, the work required to make agents reliable often outweighs the benefit.

Bret Greenstein, chief AI officer at consulting firm West Monroe, pointed out that tools like OpenClaw feel like a tipping point similar to what happened with generative AI when ChatGPT launched in 2022—for the first time, it has made the idea of AI agents accessible. Still, it’s not a 24/7 “magic solution.”

“It can work for a long time, cranking away on things, but it’s like a toddler that needs to be overseen,” he said. Some tasks are reasonable to do while you are sleeping, like scanning LinkedIn messages or tracking news. “I’m not sure I would have it answering customer feedback while I’m sleeping,” he said.

Ability to delegate to an AI agent feels powerful

Still, there is little doubt that the ability to delegate real-world tasks to an AI agent is deeply compelling for users, Greenstein emphasized. He pointed to his own experience handing an AI agent the mundane task of getting his clothes picked up to be dry cleaned—and watching it quietly complete the job end to end.

The agent independently contacted the cleaner, worked out pickup logistics through email exchanges, coordinated timing, monitored a doorbell camera to confirm the pickup, and notified Greenstein once the task was complete. The episode illustrated how agents can operate across multiple systems and adapt when things don’t go as planned. But it also underscored why such tools still require strict guardrails and oversight—especially before they are deployed in enterprise settings.

“OpenClaw is set up so it shouldn’t feel safe for most people,” Greenstein said. “It doesn’t feel mature enough to be a trusted part of our lives yet.” For AI to be welcomed into everyday life or business operations, he added, it has to earn trust over time—much the way trust is established socially.

Even so, demand is already evident. Greenstein pointed to meetups and early industry gatherings dedicated to OpenClaw, a rapid emergence he described as unusual for such a young tool. “It shows the hunger people have for AI that’s actually useful,” he said—systems that move beyond answering questions and start taking action.

Aaron Levie, CEO of cloud-based content management and collaboration company Box, called what is happening now with AI agents “little glimmers” of what might happen in the future.

“Some glimmers end up not manifesting, some glimmers just become the standard,” he explained, pointing to two years ago when AI company Cognition introduced an early agent called Devin that would integrate with Slack for task delegation, bug fixes, data analysis, and code review. At the time, it was still seen as futuristic, but today, “no one is confused that this is a standard practice,” he said. “You can just Slack Claude Code to go work on stuff – what seemed like a totally crazy idea is now basically the standard of any modern engineering team.”

But while AI agents are becoming very good at automating specific, discrete tasks, they remain poor at handling the broader, context-heavy work that makes up most jobs, Levie emphasized. AI agents may fully automate a handful of tasks, but struggle with the rest – including navigating relationships and participating in meetings.

“When you hear an AI lab say we’re going to automate all knowledge work in 24 months, that’s usually a very narrow definition of jobs,” he said. “The definition of what an agent can do is not the same definition of what the job is that gets hired in the economy.”

The trust factor matters for when things can go wrong

Avinash Vootkuri, a staff data scientist at a top Fortune 500 retailer, said that most enterprise AI agents “absolutely require a babysitter” and, for now, can only work in enterprise settings with tightly bounded autonomy and extensive guardrails. “The stakes are massive,” he explained.

For example, he described building an agentic system for enterprise cybersecurity where AI agents don’t simply trigger alerts and wait for human review, but actively investigate them. Instead of flooding analysts with thousands of warnings, the agents gather evidence in real time—querying threat-intelligence databases, analyzing behavioral patterns, and filtering out false positives—before deciding whether a situation warrants escalation.

The system relies on tightly bounded autonomy and extensive guardrails, reducing human workload without removing oversight.

In cybersecurity, he explained, if the agent gets it wrong, the consequences are immediate and severe. “The AI either blocks legitimate customers (causing massive revenue loss) or it lets a sophisticated threat actor into the network,” he said. “It absolutely matters if things go wrong.”

According to Breanna Whitehead, who runs an AI operations consultancy where she builds AI-powered systems for executives and founders, the industry is in a “trust calibration phase.”

AI agents can do more than most people let them, but less than the hype suggests.

“The real skill isn’t building the agent — it’s designing the handoff,” she explained. “Most people either over-trust agents and end up cleaning up messes, or they micromanage every output and wonder why AI feels like more work instead of less.” The idea, she said, is to design clear handoff points, where something might be fully delegated, another thing might get a quick review, while another task stays just for humans to do.

For now, she said, agents are “genuinely excellent” what she called the middle layer of knowledge work — “the stuff that used to eat 2-3 hours of a smart person’s day, like synthesizing meeting notes into action items, drafting follow-up emails in someone’s voice, pulling together research briefs, organizing competing priorities into a clear plan.”

But anything that requires reading a room, navigating ambiguity, or making judgment calls that depend on relationships are not ready for AI agent prime time. “I had a client who wanted to fully automate their investor communications,” she said. “The AI could draft beautifully, but it couldn’t sense when a funder was losing interest and needed a different approach. The agent drafted the email, but the human had to decide whether to send it.”

For now, sleep may be elusive when working with AI agents

For now, working with AI agents may have less to do with sleeping while they work than with staying half-awake while they do. Tools like OpenClaw can run for hours at a time, but for many early users, that autonomy comes with a new kind of vigilance—checking logs, reviewing outputs, and stepping in before things go wrong.

That dynamic was captured in a recent viral post titled Token Anxiety, in which investor Nikunj Kothari described a friend leaving a party early—not because he was tired, but because he wanted to get back to his agents. “Nobody questions it anymore,” Kothari wrote. “Half the room is thinking the same thing. The other half are probably checking the progress of their agents. At a party.”

The dream of AI that works while you sleep may be real. But for now, it’s still keeping a lot of people awake.

**Join us at the Fortune Workplace Innovation Summit **May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.