HomeDocsArchitecture › 8. `rysh-chrome-plugin`

8. rysh-chrome-plugin — Browser Control & Chat

rysh-chrome-plugin (~3.4K LOC of React/TypeScript in src/; ~5K including the dead root-level .js files — see §8.1) is a Chrome MV3 side-panel extension. It connects to rysh-server's NATS over a WebSocket and does two things: (1) lets the user chat with a server-side AI agent, and (2) lets that agent control the user's browser — navigate, click, type, scroll, screenshot, extract content, manage tabs.


8.1 What runs where

graph TB
    subgraph ext["Chrome extension"]
        SW["Background service worker
(background.js — legacy JS)
• sidePanel behavior
• GET_PAGE_CONTEXT injection
• auth check / sign-out
(NO NATS)"] Panel["Side-panel React app
(popup.html → src/main.tsx)
• NATS WebSocket
• chat UI + executor"] Inject["Injected MAIN-world scripts
(chrome.scripting, per-action)
• browser-executor
• selector-resolver"] AuthTab["auth.html / auth.js
(standalone onboarding)"] end Panel <-->|NATS over WS| Server["rysh-server
browser_pane_proxy"] Panel -->|chrome.scripting.executeScript| Inject SW -->|opens| AuthTab Panel -->|runtime.sendMessage GET_PAGE_CONTEXT| SW
  • Background service worker (background.js): sets sidePanel.setPanelBehavior({openPanelOnActionClick}), opens the auth tab when unauthenticated, and handles a chrome.runtime.onMessage bus (CHECK_AUTH, SIGN_OUT, OPEN_AUTH, GET_PAGE_CONTEXT). It does not touch NATS. GET_PAGE_CONTEXT injects a MAIN-world script capped at 30000 chars (falling back to tab metadata on protected pages):
func: () => {
  const MAX_BODY = 30000;
  return {
    title: document.title, url: location.href,
    selected: getSelection()?.toString() ?? '',
    description: document.querySelector('meta[name="description"]')?.content ?? '',
    body: (document.body?.innerText || '').substring(0, MAX_BODY).trim(),
  };
}
  • Side-panel React app: the live UI; hosts the persistent NATS WebSocket, the chat, and the BrowserActionExecutor.
  • No persistent content script: all page interaction is dynamic chrome.scripting.executeScript injection into the MAIN world. injectSelectorResolver installs window.__rysh_resolve_selector once per page.

Manifest (manifest_version: 3, name "Rysh AI", version 0.1.0): permissions: [storage, activeTab, scripting, tabs, sidePanel]; host_permissions: ["https://*.rysh.ai/*", "<all_urls>"]; side_panel.default_path: "popup.html"; background: {service_worker: "background.js", type: "module"}; web_accessible_resources: [auth.html]. There is no default_popup on the action — the panel opens via setPanelBehavior({openPanelOnActionClick: true}).

Legacy vs modern code

Files Status
background.js, authService.js, storage.js, auth.html/auth.js/auth.css ACTIVE — service worker + standalone auth tab (copied to dist/ by Vite)
src/ React/TS app ACTIVE — the live UI
api.js, nats-client.js, popup.js DEAD LEGACY — not copied, not referenced; superseded by src/services/*. They even use an old subject namespace (agentic vs the current llm_prompt_execution).

8.2 The NATS client (src/services/nats-client.ts)

A thin WebSocket wrapper speaking the Rysh envelope protocol (NOT raw NATS — the server's browser_pane_proxy bridges WS↔NATS).

Wire frame: JSON { subject, data: NATSEnvelope } where the TS envelope uses short keys and a base64-JSON payload:

interface NATSEnvelope { t: string; r: string; p: string; } // p = base64(JSON(payload))

static encode(typeTag: string, payload: Record<string, unknown>): NATSEnvelope {
  const bytes = new TextEncoder().encode(JSON.stringify(payload)); // UTF-8 bytes
  let binary = ''; bytes.forEach(b => { binary += String.fromCharCode(b); });
  return { t: typeTag, r: '', p: btoa(binary) };                   // UTF-8-safe base64
}

The UTF-8 TextEncoderbtoa dance avoids range errors on Unicode page text (a plain btoa(JSON) would throw). Auto-reconnect uses _baseReconnectDelay = 1000ms doubling to a 30000ms cap, up to _maxReconnectAttempts = 20; handlers (keyed by exact subject) survive reconnect. publish silently drops (warns) if the socket isn't OPEN.

Bootstrap (api.ts): POST {serverURL}/api/browser-panes with Authorization: Bearer {token}{ pane_id, ws_url }; the WS URL swaps https→wss and appends ?token=… (token auth via query param).

Subjects (all prefixed rysh.pane.{paneID}.)

Publishes (extension → server) Tag
.llm_prompt_execution.inbox MsgAgenticPrompt
.llm_prompt_execution.inbox MsgAgenticCancel
.approval.response MsgApprovalResponse
.browser.response MsgBrowserActionResponse
Subscribes (server → extension) Purpose
.llm_prompt_execution.output streaming output {type, content, metadata}
.llm_prompt_execution.status {phase, iteration, max_iterations}
.approval.request {request_id, type, description, diff, choices}
.output.{shell,rysh,chat} per-mode terminal text
.share.command.inbound a remote CLI sent a command
.browser.request the agent's browser-control requests

These map field-for-field to rysh-shared/msg (MsgBrowserActionRequest/Response, etc.).


8.3 Browser action execution (src/services/browser-executor.ts)

BrowserActionExecutor.execute(req) dispatches via a handler map. Element-targeting actions first inject the selector resolver (unless on a protected chrome:///about: page). Active-tab resolution queries lastFocusedWindow first (the side panel's own window is not the page window).

sequenceDiagram
    participant Agent as Server agent (browser_action tool)
    participant Ext as Extension (executor)
    participant Page as Target tab (MAIN world)
    Agent->>Ext: .browser.request {request_id, action, params}
    Ext->>Ext: injectSelectorResolver (if element action)
    Ext->>Page: chrome.scripting.executeScript (action fn)
    Page-->>Ext: result / error
    Ext->>Agent: .browser.response {request_id, success, result|error|screenshot}

The 23 actions:

Group Actions
Navigation navigate, back, forward, reload
Element interaction click, type, select, check, hover, press_key, drag_drop
Scroll / wait scroll (default 500px, smooth), wait (readyState or MutationObserver, default 10s)
Content extraction get_text, get_html (each capped at 50000 chars), get_elements (default limit 50), get_value
Tabs get_tabs, switch_tab (by tab_id | index | url_pattern, focuses the window), new_tab, close_tab
Screenshot screenshotcaptureVisibleTab as JPEG q40 (deliberately, to stay under NATS's ~1MB limit)
JS execution execute_jseval in MAIN world (approval gating is server-side, not enforced in the extension)

The type action uses the React-compatible native value setter so controlled inputs update (clear defaults to true):

const nativeSetter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, 'value')?.set
                  || Object.getOwnPropertyDescriptor(window.HTMLTextAreaElement.prototype, 'value')?.set;
nativeSetter ? nativeSetter.call(el, newValue) : (el.value = newValue);
el.dispatchEvent(new Event('input',  { bubbles: true }));
el.dispatchEvent(new Event('change', { bubbles: true }));

screenshot and execute_js cores:

// screenshot — JPEG q40 keeps the base64 payload under NATS's 1MB max (a PNG would be 2-3MB)
const dataUrl = await chrome.tabs.captureVisibleTab(tab.windowId!, { format: 'jpeg', quality: 40 });
return { screenshot: dataUrl.replace(/^data:image\/jpeg;base64,/, ''), ... };

// execute_js — raw eval in the page's MAIN world (a notable security surface)
func: (code: string) => { try { const r = eval(code); return { result: r }; } catch (e:any) { return { error: e.message }; } }

waitForTabLoad listens on chrome.tabs.onUpdated for complete (+300ms settle, 15s cap). BrowserActionResult = { request_id, success, result?, error?, screenshot? }.

Selector resolver (src/services/selector-resolver.ts)

Injected once per page; resolves (selector, text?, index) by strategy prefix:

Prefix Strategy
xpath: or // document.evaluate
text: direct text-node substring match (prefers shortest/most specific)
aria: [aria-label] / [aria-labelledby]
role: [role=…]
testid: [data-testid=…]
(default) CSS querySelectorAll

Then optionally filters by text, prefers visible elements, and returns pool[index].


8.4 Auth flow

API-key based against rysh-server (not Firebase, despite mirroring its API shape).

graph LR
    Setup["SetupScreen: Authorize API Key"] -->|OPEN_AUTH| BG["background opens auth.html"]
    BG --> Form["auth.js: server URL + API key"]
    Form -->|"probe {url}/health (5s)"| OK{reachable?}
    OK -->|yes| Store["chrome.storage.local:
auth_token, auth_user,
auth_time, server_url"] Store --> App["App.tsx: onAuthStateChanged →
loadServerURL → ensurePane → connected"]

The key is "verified" only by probing /health (reachability, not key validation). All requests send Authorization: Bearer {auth_token}; the WS URL carries ?token=…. Sign-out clears storage and deletes the pane.

Workspace badge (fetchServerInfo, api.ts): after auth, the extension calls GET {serverURL}/api/server-info (Bearer token) to resolve the NATS workspace name, caches it in chrome.storage.local under server_workspace, and renders it as a Header badge — used to tell CLI users which [upstream] workspace= / RYSH_UPSTREAM_WORKSPACE to set.

Settings panel (Header.tsx): a Settings modal (openSettings/saveSettings) lets the user change the server_url at runtime and re-fetch the workspace, independent of the initial auth.js onboarding.


8.5 State & UI

State (src/store.ts, Zustand): input mode (default 'prompt' — "AI mode (extension use-case)", cycleInputMode rotates MODE_CYCLE via modulo), chat messages, streaming accumulator (finalizeStreaming commits the buffer as one assistant Message with crypto.randomUUID()), terminal output buffers, connection (connected, paneID), loading/status, pendingApproval, per-mode input history (newest-first, .slice(0, 200), historyIdx default -1), share state, and browserAction (drives the indicator). escCount/escTimer hold double-ESC state (the 400ms window lives in useKeyboard.ts).

Types (src/types.ts):

export type InputMode = 'shell' | 'prompt' | 'rysh' | 'chat';
export const MODE_PROMPT: Record<InputMode,string>  = { shell:'>', prompt:'<', rysh:'##', chat:'@' };
export const MODE_LABELS: Record<InputMode,string>  = { shell:'SHELL', prompt:'AI', rysh:'RYSH', chat:'CHAT' };
export const MODE_PLACEHOLDER: Record<InputMode,string> = { /* per-mode input placeholder text */ };
export const MODE_CYCLE: InputMode[] = ['shell','prompt','rysh','chat'];

interface Message       { id; role:'user'|'assistant'|'tool'; content; timestamp:Date; mode:InputMode; streaming?; sender? }
interface PendingApproval { requestID; type; description; diff:DiffPayload|null; choices:{label,description}[] }
interface OutputEvent   { type: 'text'|'tool_call'|'tool_result'|'diff'|'error'|'shell'|'rysh'|'chat'|'user_prompt'|'browser_action'; content; metadata? }

Components:

Component Role
App.tsx resolves auth → ChatScreen or SetupScreen
ChatScreen.tsx the central event router (wires apiService.onOutput/onStatus/onApproval into the store)
Header.tsx logo, status, mode indicator, workspace badge, connection dot; Share / New-chat / Menu (use page context, settings, sign out)
PaneInput.tsx mode-aware input; captures page context in prompt mode; send→cancel while loading
PaneOutput.tsx terminal output (ANSI→HTML) for shell/rysh; message bubbles + streaming for prompt/chat; scroll-lock
MessageBubble.tsx user/assistant/tool bubbles; markdown rendering; ↗ remote: {sender} for shared prompts
ApprovalDialog.tsx modal for .approval.request; diff + choices; Yes / Yes-Always / No / No+reason
BrowserActionIndicator.tsx pulsing "Browser: {action}" banner
ModeIndicator, ErrorBanner, LoadingIndicator, DebugOverlay status/UX helpers (DebugOverlay defaults open — a removable aid)

useKeyboard.ts: double-ESC within 400ms cycles the input mode. utils/ansi.ts: ANSI→HTML + a mini markdown renderer.

Client-side ## commands (submitInput in api.ts): shell/rysh-mode input starting with ## is handled locally rather than sent to the agent — ##share pane|panegroup|tab, ##share list, ##share status, ##unshare (a stub: "not yet implemented"), and ##help. Non-## shell input returns "No shell available in browser panes" (there is no PTY in a browser pane).


8.6 Key end-to-end flows

AI prompt with page context: PaneInputGET_PAGE_CONTEXT (background injects into active tab) → POST /api/browser-panes/{id}/context + prepend <browser_page>…</browser_page>MsgAgenticPrompt → server agent runs → streams back on .output (+ .status) → streaming bubble.

Agent-controls-browser loop (headline feature): server LLM calls browser_action.browser.request → executor runs chrome.*/MAIN-world script → .browser.response → tool result returns to the LLM → loop continues.

Pane share (remote CLI drives the browser): Header SharePOST /api/browser-panes/{id}/share{share_id, subscribe_cmd} → CLI runs ##upstream subscribe {id} → command arrives on .share.command.inbound → extension captures page context and republishes as MsgAgenticPrompt.


8.7 Caveats

  • api.js, nats-client.js, popup.js are dead code (the React TS port supersedes them and uses a different subject namespace).
  • The WS token is passed as a URL query param.
  • Screenshots are forced to JPEG q40 to stay under NATS's ~1MB limit.
  • execute_js runs raw eval in the page MAIN world; approval gating is server-side, not enforced in the extension.
  • Auth "verifies" the key only by probing /health.
  • DebugOverlay defaults to open and is explicitly flagged as a removable debugging aid.