Anthropic wants its AI agent to control your computer

Demonstrations of AI agents may look stunning, but getting the technology to work reliably and without annoying (or expensive) bugs in real life can be challenging. Current models can answer questions and converse with near-human skills and are the backbone of chatbots such as OpenAI’s ChatGPT and Google’s Gemini. They can also perform tasks on computers when given a simple command by accessing the computer screen, as well as input devices such as keyboards and trackpads, or through low-level software interfaces.

Anthropic says Claude outperforms other AI agents on several key benchmarks, including SWE-bench, which measures the agent’s software development skills, and OSWorld, which measures the agent’s capacity to use a computer operating system. The claims have yet to be independently verified. Anthropic says Claude gets OSWorld tasks right 14.9 percent of the time. That’s well below humans, who typically achieve around 75 percent, but significantly higher than the current best agents—including OpenAI’s GPT-4—which succeed roughly 7.7 percent of the time.

Anthropic claims several companies are already testing the agent version of Claude. This includes Canva, which uses it to automate design and editing tasks, and Replit, which uses the model to code tasks. Other early adopters include The Browser Company, Asana and Notion.

Ofir Press, a postdoctoral researcher at Princeton University who helped develop the SWE-bench, says agent AI tends to lack the ability to plan far ahead and often struggles to recover from mistakes. “To show that they are useful, we need to achieve strong performance on difficult and realistic metrics,” he says, such as reliably planning a wide range of trips for a user and booking all the necessary tickets.

Kaplan notes that Claude can now debug some bugs surprisingly well. When it encountered a terminal error when trying to start a web server, for example, the model knew how to refactor its command to fix it. It also turned out that it was supposed to activate pop-ups when it hit a dead end while surfing the web.

Many tech companies are now racing to develop AI agents as they chase market share and prominence. In fact, it may not be long before many users have agents at their fingertips. Microsoft, which has invested more than $13 billion in OpenAI, says it is testing agents that can use Windows computers. Amazon, which has invested heavily in Anthropic, is exploring how agents could recommend and ultimately buy goods for their customers.

A key challenge with agent AI is that mistakes can be much more problematic than a garbled chatbot response. Anthropic has put certain restrictions on what Claude can do – for example, limiting his ability to use a person’s credit card to buy things.

If mistakes can be avoided well enough, says Princeton University’s Press, users may learn to see AI and computers in a whole new way. “I’m very excited about this new era,” he says.

Related Posts

Leave a Reply Cancel reply