8. rysh-chrome-plugin — Browser Control & Chat
rysh-chrome-plugin (~3.4K LOC of React/TypeScript in src/; ~5K including the dead root-level .js files — see §8.1) is a Chrome MV3 side-panel extension. It connects to rysh-server's NATS over a WebSocket and does two things: (1) lets the user chat with a server-side AI agent, and (2) lets that agent control the user's browser — navigate, click, type, scroll, screenshot, extract content, manage tabs.
8.1 What runs where
graph TB
subgraph ext["Chrome extension"]
SW["Background service worker
(background.js — legacy JS)
• sidePanel behavior
• GET_PAGE_CONTEXT injection
• auth check / sign-out
(NO NATS)"]
Panel["Side-panel React app
(popup.html → src/main.tsx)
• NATS WebSocket
• chat UI + executor"]
Inject["Injected MAIN-world scripts
(chrome.scripting, per-action)
• browser-executor
• selector-resolver"]
AuthTab["auth.html / auth.js
(standalone onboarding)"]
end
Panel <-->|NATS over WS| Server["rysh-server
browser_pane_proxy"]
Panel -->|chrome.scripting.executeScript| Inject
SW -->|opens| AuthTab
Panel -->|runtime.sendMessage GET_PAGE_CONTEXT| SW
- Background service worker (
background.js): setssidePanel.setPanelBehavior({openPanelOnActionClick}), opens the auth tab when unauthenticated, and handles achrome.runtime.onMessagebus (CHECK_AUTH,SIGN_OUT,OPEN_AUTH,GET_PAGE_CONTEXT). It does not touch NATS.GET_PAGE_CONTEXTinjects a MAIN-world script capped at 30000 chars (falling back to tab metadata on protected pages):
func: () => {
const MAX_BODY = 30000;
return {
title: document.title, url: location.href,
selected: getSelection()?.toString() ?? '',
description: document.querySelector('meta[name="description"]')?.content ?? '',
body: (document.body?.innerText || '').substring(0, MAX_BODY).trim(),
};
}
- Side-panel React app: the live UI; hosts the persistent NATS WebSocket, the chat, and the
BrowserActionExecutor. - No persistent content script: all page interaction is dynamic
chrome.scripting.executeScriptinjection into the MAIN world.injectSelectorResolverinstallswindow.__rysh_resolve_selectoronce per page.
Manifest (manifest_version: 3, name "Rysh AI", version 0.1.0): permissions: [storage, activeTab, scripting, tabs, sidePanel]; host_permissions: ["https://*.rysh.ai/*", "<all_urls>"]; side_panel.default_path: "popup.html"; background: {service_worker: "background.js", type: "module"}; web_accessible_resources: [auth.html]. There is no default_popup on the action — the panel opens via setPanelBehavior({openPanelOnActionClick: true}).
Legacy vs modern code
| Files | Status |
|---|---|
background.js, authService.js, storage.js, auth.html/auth.js/auth.css |
ACTIVE — service worker + standalone auth tab (copied to dist/ by Vite) |
src/ React/TS app |
ACTIVE — the live UI |
api.js, nats-client.js, popup.js |
DEAD LEGACY — not copied, not referenced; superseded by src/services/*. They even use an old subject namespace (agentic vs the current llm_prompt_execution). |
8.2 The NATS client (src/services/nats-client.ts)
A thin WebSocket wrapper speaking the Rysh envelope protocol (NOT raw NATS — the server's browser_pane_proxy bridges WS↔NATS).
Wire frame: JSON { subject, data: NATSEnvelope } where the TS envelope uses short keys and a base64-JSON payload:
interface NATSEnvelope { t: string; r: string; p: string; } // p = base64(JSON(payload))
static encode(typeTag: string, payload: Record<string, unknown>): NATSEnvelope {
const bytes = new TextEncoder().encode(JSON.stringify(payload)); // UTF-8 bytes
let binary = ''; bytes.forEach(b => { binary += String.fromCharCode(b); });
return { t: typeTag, r: '', p: btoa(binary) }; // UTF-8-safe base64
}
The UTF-8 TextEncoder→btoa dance avoids range errors on Unicode page text (a plain btoa(JSON) would throw). Auto-reconnect uses _baseReconnectDelay = 1000ms doubling to a 30000ms cap, up to _maxReconnectAttempts = 20; handlers (keyed by exact subject) survive reconnect. publish silently drops (warns) if the socket isn't OPEN.
Bootstrap (api.ts): POST {serverURL}/api/browser-panes with Authorization: Bearer {token} → { pane_id, ws_url }; the WS URL swaps https→wss and appends ?token=… (token auth via query param).
Subjects (all prefixed rysh.pane.{paneID}.)
| Publishes (extension → server) | Tag |
|---|---|
.llm_prompt_execution.inbox |
MsgAgenticPrompt |
.llm_prompt_execution.inbox |
MsgAgenticCancel |
.approval.response |
MsgApprovalResponse |
.browser.response |
MsgBrowserActionResponse |
| Subscribes (server → extension) | Purpose |
|---|---|
.llm_prompt_execution.output |
streaming output {type, content, metadata} |
.llm_prompt_execution.status |
{phase, iteration, max_iterations} |
.approval.request |
{request_id, type, description, diff, choices} |
.output.{shell,rysh,chat} |
per-mode terminal text |
.share.command.inbound |
a remote CLI sent a command |
.browser.request |
the agent's browser-control requests |
These map field-for-field to rysh-shared/msg (MsgBrowserActionRequest/Response, etc.).
8.3 Browser action execution (src/services/browser-executor.ts)
BrowserActionExecutor.execute(req) dispatches via a handler map. Element-targeting actions first inject the selector resolver (unless on a protected chrome:///about: page). Active-tab resolution queries lastFocusedWindow first (the side panel's own window is not the page window).
sequenceDiagram
participant Agent as Server agent (browser_action tool)
participant Ext as Extension (executor)
participant Page as Target tab (MAIN world)
Agent->>Ext: .browser.request {request_id, action, params}
Ext->>Ext: injectSelectorResolver (if element action)
Ext->>Page: chrome.scripting.executeScript (action fn)
Page-->>Ext: result / error
Ext->>Agent: .browser.response {request_id, success, result|error|screenshot}
The 23 actions:
| Group | Actions |
|---|---|
| Navigation | navigate, back, forward, reload |
| Element interaction | click, type, select, check, hover, press_key, drag_drop |
| Scroll / wait | scroll (default 500px, smooth), wait (readyState or MutationObserver, default 10s) |
| Content extraction | get_text, get_html (each capped at 50000 chars), get_elements (default limit 50), get_value |
| Tabs | get_tabs, switch_tab (by tab_id | index | url_pattern, focuses the window), new_tab, close_tab |
| Screenshot | screenshot — captureVisibleTab as JPEG q40 (deliberately, to stay under NATS's ~1MB limit) |
| JS execution | execute_js — eval in MAIN world (approval gating is server-side, not enforced in the extension) |
The type action uses the React-compatible native value setter so controlled inputs update (clear defaults to true):
const nativeSetter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, 'value')?.set
|| Object.getOwnPropertyDescriptor(window.HTMLTextAreaElement.prototype, 'value')?.set;
nativeSetter ? nativeSetter.call(el, newValue) : (el.value = newValue);
el.dispatchEvent(new Event('input', { bubbles: true }));
el.dispatchEvent(new Event('change', { bubbles: true }));
screenshot and execute_js cores:
// screenshot — JPEG q40 keeps the base64 payload under NATS's 1MB max (a PNG would be 2-3MB)
const dataUrl = await chrome.tabs.captureVisibleTab(tab.windowId!, { format: 'jpeg', quality: 40 });
return { screenshot: dataUrl.replace(/^data:image\/jpeg;base64,/, ''), ... };
// execute_js — raw eval in the page's MAIN world (a notable security surface)
func: (code: string) => { try { const r = eval(code); return { result: r }; } catch (e:any) { return { error: e.message }; } }
waitForTabLoad listens on chrome.tabs.onUpdated for complete (+300ms settle, 15s cap). BrowserActionResult = { request_id, success, result?, error?, screenshot? }.
Selector resolver (src/services/selector-resolver.ts)
Injected once per page; resolves (selector, text?, index) by strategy prefix:
| Prefix | Strategy |
|---|---|
xpath: or // |
document.evaluate |
text: |
direct text-node substring match (prefers shortest/most specific) |
aria: |
[aria-label] / [aria-labelledby] |
role: |
[role=…] |
testid: |
[data-testid=…] |
| (default) | CSS querySelectorAll |
Then optionally filters by text, prefers visible elements, and returns pool[index].
8.4 Auth flow
API-key based against rysh-server (not Firebase, despite mirroring its API shape).
graph LR
Setup["SetupScreen: Authorize API Key"] -->|OPEN_AUTH| BG["background opens auth.html"]
BG --> Form["auth.js: server URL + API key"]
Form -->|"probe {url}/health (5s)"| OK{reachable?}
OK -->|yes| Store["chrome.storage.local:
auth_token, auth_user,
auth_time, server_url"]
Store --> App["App.tsx: onAuthStateChanged →
loadServerURL → ensurePane → connected"]
The key is "verified" only by probing /health (reachability, not key validation). All requests send Authorization: Bearer {auth_token}; the WS URL carries ?token=…. Sign-out clears storage and deletes the pane.
Workspace badge (fetchServerInfo, api.ts): after auth, the extension calls GET {serverURL}/api/server-info (Bearer token) to resolve the NATS workspace name, caches it in chrome.storage.local under server_workspace, and renders it as a Header badge — used to tell CLI users which [upstream] workspace= / RYSH_UPSTREAM_WORKSPACE to set.
Settings panel (Header.tsx): a Settings modal (openSettings/saveSettings) lets the user change the server_url at runtime and re-fetch the workspace, independent of the initial auth.js onboarding.
8.5 State & UI
State (src/store.ts, Zustand): input mode (default 'prompt' — "AI mode (extension use-case)", cycleInputMode rotates MODE_CYCLE via modulo), chat messages, streaming accumulator (finalizeStreaming commits the buffer as one assistant Message with crypto.randomUUID()), terminal output buffers, connection (connected, paneID), loading/status, pendingApproval, per-mode input history (newest-first, .slice(0, 200), historyIdx default -1), share state, and browserAction (drives the indicator). escCount/escTimer hold double-ESC state (the 400ms window lives in useKeyboard.ts).
Types (src/types.ts):
export type InputMode = 'shell' | 'prompt' | 'rysh' | 'chat';
export const MODE_PROMPT: Record<InputMode,string> = { shell:'>', prompt:'<', rysh:'##', chat:'@' };
export const MODE_LABELS: Record<InputMode,string> = { shell:'SHELL', prompt:'AI', rysh:'RYSH', chat:'CHAT' };
export const MODE_PLACEHOLDER: Record<InputMode,string> = { /* per-mode input placeholder text */ };
export const MODE_CYCLE: InputMode[] = ['shell','prompt','rysh','chat'];
interface Message { id; role:'user'|'assistant'|'tool'; content; timestamp:Date; mode:InputMode; streaming?; sender? }
interface PendingApproval { requestID; type; description; diff:DiffPayload|null; choices:{label,description}[] }
interface OutputEvent { type: 'text'|'tool_call'|'tool_result'|'diff'|'error'|'shell'|'rysh'|'chat'|'user_prompt'|'browser_action'; content; metadata? }
Components:
| Component | Role |
|---|---|
App.tsx |
resolves auth → ChatScreen or SetupScreen |
ChatScreen.tsx |
the central event router (wires apiService.onOutput/onStatus/onApproval into the store) |
Header.tsx |
logo, status, mode indicator, workspace badge, connection dot; Share / New-chat / Menu (use page context, settings, sign out) |
PaneInput.tsx |
mode-aware input; captures page context in prompt mode; send→cancel while loading |
PaneOutput.tsx |
terminal output (ANSI→HTML) for shell/rysh; message bubbles + streaming for prompt/chat; scroll-lock |
MessageBubble.tsx |
user/assistant/tool bubbles; markdown rendering; ↗ remote: {sender} for shared prompts |
ApprovalDialog.tsx |
modal for .approval.request; diff + choices; Yes / Yes-Always / No / No+reason |
BrowserActionIndicator.tsx |
pulsing "Browser: {action}" banner |
ModeIndicator, ErrorBanner, LoadingIndicator, DebugOverlay |
status/UX helpers (DebugOverlay defaults open — a removable aid) |
useKeyboard.ts: double-ESC within 400ms cycles the input mode. utils/ansi.ts: ANSI→HTML + a mini markdown renderer.
Client-side ## commands (submitInput in api.ts): shell/rysh-mode input starting with ## is handled locally rather than sent to the agent — ##share pane|panegroup|tab, ##share list, ##share status, ##unshare (a stub: "not yet implemented"), and ##help. Non-## shell input returns "No shell available in browser panes" (there is no PTY in a browser pane).
8.6 Key end-to-end flows
AI prompt with page context: PaneInput → GET_PAGE_CONTEXT (background injects into active tab) → POST /api/browser-panes/{id}/context + prepend <browser_page>…</browser_page> → MsgAgenticPrompt → server agent runs → streams back on .output (+ .status) → streaming bubble.
Agent-controls-browser loop (headline feature): server LLM calls browser_action → .browser.request → executor runs chrome.*/MAIN-world script → .browser.response → tool result returns to the LLM → loop continues.
Pane share (remote CLI drives the browser): Header Share → POST /api/browser-panes/{id}/share → {share_id, subscribe_cmd} → CLI runs ##upstream subscribe {id} → command arrives on .share.command.inbound → extension captures page context and republishes as MsgAgenticPrompt.
8.7 Caveats
api.js,nats-client.js,popup.jsare dead code (the React TS port supersedes them and uses a different subject namespace).- The WS token is passed as a URL query param.
- Screenshots are forced to JPEG q40 to stay under NATS's ~1MB limit.
execute_jsruns rawevalin the page MAIN world; approval gating is server-side, not enforced in the extension.- Auth "verifies" the key only by probing
/health. DebugOverlaydefaults to open and is explicitly flagged as a removable debugging aid.