Bottom line
Anthropic reports that Claude Opus 4.7, operating through Claude Code with a researcher approving commands, completed several robodog setup/control tasks much faster than human teams from a prior Project Fetch experiment. The same source says the model still struggled with precise physical “fetching.” Read it as a red-team warning signal, not a robotics victory lap.
What Anthropic says it tested
Project Fetch is Anthropic’s internal experiment using an off-the-shelf robotic quadruped — a “robodog” — as a way to test how much frontier models can help with sophisticated physical-tool tasks. In the August 2025 phase, Anthropic employees who were not robotics experts worked through tasks such as using the manufacturer controller, connecting to the robot, reading sensors and trying to get the robot to interact with a beach ball.
In the June 18, 2026 phase-two update, Anthropic says the test was revisited with newer models. The researcher’s role was limited: plug in a laptop running Claude Code, enter the initial prompt, approve commands, and approve movement to the next task. Anthropic reports that Claude Opus 4.7 was “about 20 times faster than the fastest human team” on tasks completed by participants less than a year earlier.
Why the result matters
The important part is not the robot dog itself. The important part is the pattern: a general-purpose model can inspect an unfamiliar technical system, choose interfaces, write code, recover from some mistakes and make progress on hardware tasks. That is the same broad agentic pattern already visible in coding, cyber and lab-science settings — only now it touches the physical world.
For Managing Expectations, this belongs in the AI Papers Library because it is a clean example of frontier-AI progress that should be tracked with source discipline. The correct label is not “robots are solved.” The better label is: AI agents are improving at tool use, and physical tool use has different safety stakes than text-only chat.
What the source does not prove
- It does not prove general robotic competence.
- It does not prove autonomous real-world deployment is safe or ready.
- It does not prove that industrial robots, drones, vehicles, homes or hospitals can be delegated to frontier models without separate evaluation.
- It does not replace independent replication or standardized robotics benchmarks.
Anthropic itself includes the key caveat: the latest Claude models still struggled with the precise beach-ball retrieval part of the task. That matters. Physical action requires timing, contact, perception, friction, error recovery and situational constraints that are easy to hide inside a simple headline.
The practical read
Project Fetch Phase Two is best read as a bridge between three AI safety lanes:
- Agentic coding: models increasingly write and run code with less step-by-step human direction.
- Cyber and tool-use red teaming: model capability can change the time needed to exploit or operate technical systems.
- Physical-agent safety: once code controls hardware, “mostly works” can become a safety problem rather than just a software bug.
The useful public question is not whether a robot dog can fetch a beach ball. The useful question is what happens when capable agents are given access to more off-the-shelf tools, APIs, sensors, actuators and payment or logistics systems. That is where hype and risk both need to be slowed down into evidence.
Source trail
- Anthropic Research — Project Fetch: Phase two
- Anthropic Research — Agentic coding and persistent returns to expertise
- Anthropic Research — Measuring LLMs’ impact on N-day exploits
- Managing Expectations source note for this article
Managing Expectations framing
Possible does not mean solved. Faster does not mean safe. A company red-team post is not the same as an independent benchmark. But this is exactly the kind of source-visible signal the AI Papers Library is meant to preserve.
Open the AI Papers Library