Adept previews ACT-1, a large-scale transformer trained to use digital tools via a browser

Adept has published a capability preview of ACT-1, which the company describes as a large-scale transformer trained to use digital tools. According to the Adept blog post, ACT-1 is currently connected to a Chrome extension that allows it to observe what is happening in a browser and take actions including clicking, typing, and scrolling.

The post describes the observation as “a custom ‘rendering’ of the browser viewport that’s meant to generalize across websites,” with the action space consisting of UI elements available on the page.

What the post shows ACT-1 doing

The post describes several capabilities via video demonstrations, which it notes have been sped up. In one example, the post says a user types a command and ACT-1 executes it by “repeatedly taking actions and observations over a long time horizon to fulfill a single goal.” A specific example involves Salesforce: the post says what might ordinarily require “10+ clicks in Salesforce can be now done with just a sentence.”

The post also describes ACT-1 working within spreadsheet tools, where it “demonstrates real-world knowledge, infers what we mean from context, and can help us do things we may not even know how to do.” A further example involves tasks that require composing multiple tools, which the post says reflects how most computer tasks span more than one application.

ACT-1 also handles cases where it lacks information. The post says “when the model doesn’t know something, it knows how to just look up the information online,” demonstrated in what the post describes as a voice mode example.

On coachability, the post states: “ACT-1 doesn’t know how to do everything, but it’s highly coachable. With 1 piece of human feedback, it can correct mistakes.”

The post does not provide benchmark numbers, error rates, accuracy figures, or latency specifications. On speed, it notes: “There’s a lot of room to make it faster, both on the modeling side and on the software side — so we expect future systems will have latency that’s largely imperceptible to humans.” The videos, the post says, are sped up to make them easier to view.

How Adept frames the goal

The post states the company’s framing of general intelligence as “a system that can do anything a human can do in front of a computer.” It describes a foundation model for actions — one “trained to use every software tool, API, and webapp that exists” — as “a practical path to this ambitious goal,” with ACT-1 described as “our first step in this direction.”

The post includes a list of projected changes the company says it believes will occur “a few years from now”: most computer interaction will shift to natural language; beginners will be able to implement ideas without software training; documentation and manuals will be written for models rather than people; and advances across scientific fields will be accelerated by AI collaboration. These projections are presented as the company’s own beliefs, not as findings from research.

Safety acknowledgment

The post includes a brief safety section. The company acknowledges that its systems “have the potential to cause harm if misused or misaligned with user preferences.” It states the goal is to build a company with “large-scale human feedback at the center,” with models evaluated on how well they satisfy user preferences. To address misuse, the post says Adept plans to use “a combination of machine learning techniques and careful, staged deployment.”

A subsequent technical post covering the model’s architecture and training details was referenced but not published at the time of this blog entry.