Running Transformers.js in a Chrome extension: the Manifest V3 architecture that actually works

Hugging Face has published a detailed guide to running Transformers.js inside a Chrome extension under Manifest V3 constraints. The reference implementation is the Transformers.js Gemma 4 Browser Assistant, available on the Chrome Web Store with source at github.com/nico-martin/gemma4-browser-extension. The post is written for developers who want to run local AI features in an extension — inference stays inside the browser, no external API calls required — and addresses the specific runtime constraints that make this non-obvious.

The core problem is that Manifest V3 fragments an extension into multiple isolated execution contexts: a background service worker, a side panel, and a content script. Each context has different capabilities, different lifetimes, and different security boundaries. Getting inference to work reliably requires making deliberate decisions about what runs where, and setting up a messaging contract that keeps those contexts coordinated.

Runtime separation: keep models in the background

The key design decision the guide emphasizes is to host all model inference in the background service worker and treat the side panel and content script as thin workers that request actions and render results. The background runs the control plane — agent lifecycle, model initialization, tool execution, and feature extraction. The side panel handles chat UI and streaming updates. The content script handles DOM extraction and highlight actions.

This separation has direct practical consequences. Because both models load in the background, there is a single model host for all tabs and sessions, which avoids duplicate memory usage and keeps the UI responsive. Model artifacts are cached under the extension’s own origin (chrome-extension://<extension-id>) rather than per-website origins, which means one shared cache covers the entire install across all tabs.

The extension uses two models with distinct roles. Text generation runs via onnx-community/gemma-4-E2B-it-ONNX at q4f16 quantization on WebGPU. Semantic similarity for the ask_website and find_history tools runs via onnx-community/all-MiniLM-L6-v2-ONNX at fp32. Splitting these responsibilities keeps the generation model focused on reasoning and tool decisions while offloading embedding work to a smaller, faster model.

One constraint the guide flags explicitly: MV3 service workers can be suspended and restarted by Chrome at any time. Model runtime state must be treated as recoverable and re-initialized on restart. The model lifecycle is made explicit through two message types — CHECK_MODELS inspects what is already cached and estimates remaining download size, and INITIALIZE_MODELS downloads and initializes while emitting DOWNLOAD_PROGRESS back to the UI.

Messaging contract

All inter-runtime communication runs through typed message enums defined in src/shared/types.ts. The guide documents these in three groups. Side panel to background messages cover model lifecycle (CHECK_MODELS, INITIALIZE_MODELS) and agent operations (AGENT_INITIALIZE, AGENT_GENERATE_TEXT, AGENT_GET_MESSAGES, AGENT_CLEAR) and feature extraction. Background to side panel messages cover download progress and message list updates. Background to content script messages cover page data extraction, element highlighting, and clearing highlights.

The orchestration rule the guide states is simple: the background is the single coordinator; the side panel and content script are specialized workers that request actions and render results. A typical flow is: side panel sends AGENT_GENERATE_TEXT, background appends to Agent.chatMessages and runs model and tool steps, background emits MESSAGES_UPDATE, side panel re-renders from the updated message list. Conversation history lives in the background, not in the UI.

Tool calling and the agent loop

The guide explains how Transformers.js handles tool calling before describing the extension’s implementation. You pass messages and a tool schema — name, description, and parameters — and the library formats the prompt using the model’s chat template. With Gemma-4-style templates, the model emits a special tool-call token block when it decides to invoke a tool. The raw output looks like <|tool_call>call:getWeather{location:<|"|>Bern<|"|>}<tool_call|>, which must be parsed into deterministic tool executions.

The extension wraps this in two layers: webMcp in src/background/agent/webMcp.tsx normalizes extension tools into a compatible schema, and extractToolCalls parses model output into structured calls. This normalization is necessary because the raw model output format is not directly executable.

On permissions: the guide treats these as architecture decisions, not an afterthought. The manifest requests sidePanel, storage, scripting, tabs, and host_permissions for http(s)://*/*. The broad host permissions are required because content extraction and highlighting are designed to work on arbitrary websites. The guide notes that permissions define user trust and Chrome Web Store review risk, and recommends requesting only what features actually need, with clear disclosure that inference runs locally inside the extension runtime.

The overall pattern described is applicable beyond this specific extension. The architectural principles — single model host in the background, typed messaging contract, explicit model lifecycle management, and local-first inference — generalize to any MV3 extension that wants to run on-device AI without external API dependencies.