Skip to content

SOP-OPS-01: Operations Self-Service Actions v1.2

Every operationally recoverable alert gets a safe portal action. Infrastructure and architecture alerts are explicitly classified as founder-only.

Document ID: SOP-OPS-01-v1.2 Version: 1.2 Status: ✅ Specification Ready (for current Ops Portal architecture) Last Updated: 2026-04-12 Owner: Protocol Raw Operations Replaces: SOP-OPS-01-v1.1 Review Date: 2026-07-12

Auth Note: Auth pattern follows established Ops Portal perimeter model (Cloudflare Access + browser-side service_role client). Actor email resolved and logged server-side for audit traceability. Fine-grained role gating deferred to Phase B when multiple operators are added.


Key Changes in v1.2

  • ✅ Auth section reworded: honest about existing perimeter model, not claiming upgraded trust
  • ✅ Dashboard contract completed: support tile added, SLO metric added to order pipeline
  • ✅ Lab entry idempotency fixed: unique on batch_id regardless of entry method
  • ✅ Lab entry hard-blocks PASS when pathogen detected (no longer a warning)
  • ✅ Worker Controls allowlist aligned with spec table (webhook processor + lifecycle worker added)
  • ✅ Proof job recovery panel restored (dropped in v1.1 by document drift)
  • ✅ All 5 TBD compliance RPCs completed with insert functions
  • ✅ "All RPC wrappers complete" claim is now true
  • ✅ Status line reflects scope: "for current Ops Portal architecture"

Key Changes in v1.1

  • ✅ Governing principle rewritten: honest about founder-only alerts
  • ✅ Auth model: Cloudflare Access JWT → agent_id, no user-supplied actor fields
  • ✅ Every action has an explicit execution path (what consumer picks up the state change)
  • ✅ Manual lab entry redesigned: lab verdict first, idempotency guard, full micro inputs
  • ✅ Bulk retry actions replaced with safe controls (preview, select, cap, confirm)
  • ✅ Missing-event recovery added (replay order, run worker now)
  • ✅ Dashboard function rewritten to match all tiles, cron health from registry
  • ✅ Broken SQL fixed (webhook retry, outbox source/function mismatch)
  • ✅ Rejected batch disposition form added
  • ✅ UI language guidelines added (human-first, technical in expandable detail)

Purpose

The Operations Lead (joining Month 7-9) must be able to run day-to-day operations, resolve operationally recoverable alerts, and handle exceptions entirely from the Operations Portal.

Two classes of alert exist:

Class Examples Resolution
Operator-recoverable Failed webhook, stuck email, quarantined lab result, dead-lettered refund, inventory discrepancy, fraud-flagged referral Portal action. No founder involvement.
Founder-only (infrastructure) Canary system health failure, Edge Function crash/redeploy needed, database schema issue, Supabase Storage failure, repeated cron job failures after operator retry Ops Lead escalates to founder with context from portal diagnostics.

The Ops Lead should never need to access the Supabase dashboard, run SQL, use a terminal, or deploy code. The founder's involvement is reserved for architecture changes, code deployments, and infrastructure failures.


Scope

Applies to: Operations Portal (ops.protocolraw.co.uk), Portal v4.1+

New Modules:

Module Tab Purpose
System Health Dashboard System Health (new) Morning check, all-system visibility
Exception Queue System Health Item-level actions for recoverable failures
Worker Controls System Health Run-now buttons for stalled background workers
Cron Job Status System Health Visibility into scheduled job health
Canned Diagnostics System Health Pre-built read-only views
Compliance Recording Compliance (new) Temperature audits, evidence packs, spot checks, sample logs, batch disposition

Related SOPs: MON-01, LAB-01, INV-01, REF-01, ORD-01, ORD-02, ORD-03, SUB-00, CS-04, DLV-01, COLD-01, PACK-01, PROOF-00, PROOF-01


Part 1: Authentication Model

Existing Portal Pattern

The Ops Portal has used the same trust model since v2.0: Cloudflare Access (Google OAuth) as the perimeter gate, with browser-side Supabase service_role client for database calls. Every existing feature (batch creation, refund processing, support workstation, live chat) operates within this boundary.

SOP-OPS-01 inherits this pattern. It does not introduce a new auth architecture.

What this SOP improves over the existing portal pattern: Actor identity is now resolved server-side and logged on every mutation. Previously, portal actions either didn't log who performed them or used ad-hoc identifiers. The new modules ensure every action is attributable to a specific agent via the agents table.

What this does not do: This is not cryptographic caller verification. The browser sends the agent email extracted from the Cloudflare Access JWT, and the database resolves it. A caller who somehow bypasses Cloudflare Access could spoof the email. The compensating control is that Cloudflare Access is the sole entry point, and only the founder can add/remove authorised emails.

Phase B enhancement (multi-operator): When multiple operators are added, introduce role-based action gating (e.g., ops_lead can retry webhooks but not adjust inventory without ops_senior role). Not needed with a single operator.

Browser (ops.protocolraw.co.uk)
Cloudflare Access (Google OAuth, validates email)
Portal JavaScript (reads CF Access JWT, extracts email)
Supabase RPC call (service_role key, agent email passed as function argument)
Public RPC wrapper resolves email → agent_id via fn_resolve_ops_agent_v1
Domain function receives trusted agent_id UUID
All audit logging uses resolved agent_id

Agent Resolution Function

CREATE OR REPLACE FUNCTION raw_ops.fn_resolve_ops_agent_v1(p_email TEXT)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_agent RECORD;
BEGIN
  SELECT id, display_name, role INTO v_agent
  FROM agents
  WHERE email = lower(trim(p_email)) AND active = true;

  IF v_agent IS NULL THEN
    RAISE EXCEPTION 'Agent not found or inactive: %', p_email
      USING ERRCODE = 'P0002';
  END IF;

  RETURN jsonb_build_object(
    'agent_id', v_agent.id,
    'display_name', v_agent.display_name,
    'role', v_agent.role
  );
END;
$$;

Portal-Side: Extracting Agent Email

The Ops Portal already loads behind Cloudflare Access. The CF Access JWT is available as a cookie (CF_Authorization). The portal JavaScript decodes the JWT to extract the authenticated email:

// In system-health.js module init
function getAgentEmail() {
  const token = document.cookie
    .split('; ')
    .find(c => c.startsWith('CF_Authorization='))
    ?.split('=')[1];
  if (!token) return null;
  try {
    const payload = JSON.parse(atob(token.split('.')[1]));
    return payload.email;
  } catch { return null; }
}

RPC Wrapper Pattern (v1.2)

Every mutation RPC accepts the agent email, resolves it server-side, and uses the resolved agent_id for audit logging. The email comes from the Cloudflare Access JWT, not from user input.

-- Example: all v1.2 mutation RPCs follow this pattern
CREATE OR REPLACE FUNCTION public.ops_retry_lab_quarantine(p_failure_id UUID, p_agent_email TEXT)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_agent JSONB;
BEGIN
  -- Resolve agent (throws if not found/inactive)
  v_agent := raw_ops.fn_resolve_ops_agent_v1(p_agent_email);

  -- Delegate to domain function with trusted agent_id
  RETURN raw_ops.fn_retry_lab_quarantine_v1(p_failure_id, (v_agent->>'agent_id')::UUID);
END;
$$;

GRANT EXECUTE ON FUNCTION public.ops_retry_lab_quarantine TO service_role;

Key Constraints

  1. No function accepts p_entered_by TEXT or similar free-text actor fields. All actor identity is resolved from agent email via fn_resolve_ops_agent_v1.
  2. All domain functions accept p_agent_id UUID, not email or display name.
  3. The agents table is the source of truth. Daniel's record is created before he starts. Only the founder can add/remove agents.
  4. Read-only functions (dashboard, diagnostics, cron health) do not require agent resolution.

Part 2: System Health Dashboard

Purpose

Single screen the Ops Lead checks every morning. Shows the health of every automated system with green/amber/red indicators. Clicking any non-green tile opens the relevant Exception Queue panel.

Health Tiles

# Tile Green Amber Red
1 Order Pipeline 0 failed webhooks, SLO met 1-5 failed in 24h >5 failed OR SLO breach
2 Email Delivery 0 stuck, <10 pending 1-10 stuck OR 10-50 pending >10 stuck OR >50 pending
3 Inventory All SKUs above reorder Any SKU in LOW band Any SKU CRITICAL
4 Batch Testing 0 batches >3 days in QA_HOLD 1+ batch >3 days 1+ batch >5 days
5 Lab Processing 0 quarantined emails 1-2 quarantined 3+ quarantined
6 Refunds 0 failed or stuck 1-3 retrying Any stuck >1h or dead-lettered
7 Support 0 unassigned >1h 1-3 waiting >3 waiting >1h
8 Subscriptions 0 failed Seal webhooks 1-5 failed >5 failed
9 Payments No dunning stalls Day 7+ without email Day 10+ (imminent pause)
10 Deliveries <2% exception rate (7d) 2-5% exceptions >5% exceptions
11 Referrals 0 fraud-flagged 1-3 flagged >3 flagged
12 Proof Pages Last check passed Last check >1h ago Last check failed
13 Background Jobs All critical jobs healthy 1+ stale (missed expected interval) 1+ overdue (missed 2x expected interval), failed, missing, or never run

Dashboard Data Function

CREATE OR REPLACE FUNCTION raw_ops.fn_get_system_health_dashboard_v1()
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_result JSONB;
BEGIN
  SELECT jsonb_build_object(
    'generated_at', now(),

    'order_pipeline', (
      SELECT jsonb_build_object(
        'failed_24h', COUNT(*) FILTER (WHERE processing_status = 'failed'
          AND received_at > now() - interval '24 hours'),
        'pending', COUNT(*) FILTER (WHERE processing_status = 'pending'),
        'slo_breach', EXISTS (
          SELECT 1 FROM webhook_inbox
          WHERE processing_status = 'pending'
            AND received_at < now() - interval '5 minutes'
        ),
        'oldest_pending_mins', COALESCE(
          EXTRACT(EPOCH FROM (now() - MIN(received_at)
            FILTER (WHERE processing_status = 'pending')))::int / 60, 0)
      ) FROM webhook_inbox
    ),

    'email_delivery', (
      SELECT jsonb_build_object(
        'stuck', COUNT(*) FILTER (WHERE status = 'pending'
          AND created_at < now() - interval '30 minutes'),
        'pending_total', COUNT(*) FILTER (WHERE status = 'pending'),
        'failed_today', COUNT(*) FILTER (WHERE status = 'failed'
          AND created_at >= CURRENT_DATE)
      ) FROM email_outbox
    ),

    'inventory', (
      SELECT COALESCE(jsonb_agg(jsonb_build_object(
        'sku', sku, 'total_units', total_units,
        'level', CASE
          WHEN total_units <= 20 THEN 'critical'
          WHEN total_units <= 40 THEN 'low'
          ELSE 'ok' END
      )), '[]'::jsonb)
      FROM (
        SELECT p.sku, SUM(FLOOR(i.qty_available / p.kg))::int as total_units
        FROM inventory i
        JOIN batches b ON i.batch_id = b.id
        JOIN products p ON b.product_id = p.id
        WHERE b.status = 'RELEASED' AND i.qty_available > 0
        GROUP BY p.sku
      ) inv
    ),

    'batch_testing', (
      SELECT jsonb_build_object(
        'qa_hold_count', COUNT(*),
        'oldest_qa_hold_days', COALESCE(EXTRACT(DAY FROM (now() - MIN(created_at)))::int, 0),
        'over_3_days', COUNT(*) FILTER (WHERE created_at < now() - interval '3 days'),
        'over_5_days', COUNT(*) FILTER (WHERE created_at < now() - interval '5 days')
      ) FROM batches WHERE status = 'QA_HOLD'
    ),

    'lab_processing', (
      SELECT jsonb_build_object(
        'pending_review', COUNT(*) FILTER (WHERE status = 'pending_review'),
        'oldest_hours', COALESCE(
          EXTRACT(EPOCH FROM (now() - MIN(created_at)
            FILTER (WHERE status = 'pending_review')))::int / 3600, 0)
      ) FROM lab_email_failures
    ),

    'refunds', (
      SELECT jsonb_build_object(
        'dead_lettered', COUNT(*) FILTER (WHERE dead_lettered_at IS NOT NULL),
        'failed_retrying', COUNT(*) FILTER (
          WHERE status = 'failed' AND dead_lettered_at IS NULL),
        'pending', COUNT(*) FILTER (WHERE status = 'pending'),
        'oldest_pending_mins', COALESCE(
          EXTRACT(EPOCH FROM (now() - MIN(created_at)
            FILTER (WHERE status = 'pending')))::int / 60, 0)
      ) FROM refunds
    ),

    'subscriptions', (
      SELECT jsonb_build_object(
        'failed_webhooks', COUNT(*) FILTER (WHERE processing_status = 'failed'),
        'pending_webhooks', COUNT(*) FILTER (WHERE processing_status = 'pending')
      ) FROM seal_webhook_inbox
    ),

    'payments', (
      SELECT jsonb_build_object(
        'dunning_active', COUNT(*),
        'day7_plus', COUNT(*) FILTER (
          WHERE dunning_started_at < now() - interval '7 days'),
        'day10_plus', COUNT(*) FILTER (
          WHERE dunning_started_at < now() - interval '10 days')
      ) FROM subscriptions
      WHERE payment_status IN ('failing', 'failed') AND status = 'ACTIVE'
    ),

    'deliveries', (
      SELECT jsonb_build_object(
        'total_7d', COUNT(*),
        'exceptions_7d', COUNT(*) FILTER (
          WHERE status IN ('EXCEPTION', 'FAILED', 'RETURNED')),
        'exception_rate_pct', CASE WHEN COUNT(*) > 0
          THEN ROUND(COUNT(*) FILTER (
            WHERE status IN ('EXCEPTION', 'FAILED', 'RETURNED'))
            * 100.0 / COUNT(*), 1)
          ELSE 0 END
      ) FROM shipments
      WHERE created_at > now() - interval '7 days'
    ),

    'support', (
      SELECT jsonb_build_object(
        'unassigned_total', COUNT(*) FILTER (
          WHERE assigned_to IS NULL AND status NOT IN ('resolved', 'closed')),
        'unassigned_over_1h', COUNT(*) FILTER (
          WHERE assigned_to IS NULL AND status NOT IN ('resolved', 'closed')
          AND created_at < now() - interval '1 hour'),
        'open_total', COUNT(*) FILTER (
          WHERE status NOT IN ('resolved', 'closed'))
      ) FROM support_tickets
    ),

    'referrals', (
      SELECT jsonb_build_object(
        'fraud_flagged', COUNT(*) FILTER (WHERE status = 'fraud_flagged'),
        'oldest_flagged_days', COALESCE(
          EXTRACT(DAY FROM (now() - MIN(created_at)
            FILTER (WHERE status = 'fraud_flagged')))::int, 0)
      ) FROM referrals
    ),

    'proof_pages', (
      SELECT jsonb_build_object(
        'last_check_at', MAX(run_at),
        'last_status', (
          SELECT status FROM monitoring_runs
          WHERE check_name = 'proof_portal_health'
          ORDER BY run_at DESC LIMIT 1),
        'minutes_since_check', COALESCE(
          EXTRACT(EPOCH FROM (now() - MAX(run_at)))::int / 60, 999)
      ) FROM monitoring_runs
      WHERE check_name = 'proof_portal_health'
    ),

    'background_jobs', raw_ops.fn_get_cron_health_summary_v1()

  ) INTO v_result;

  RETURN v_result;
END;
$$;

Cron Health from Registry (not pattern matching)

v1.0 used LIKE '%* * * *%' on cron schedule strings. That is unreliable. v1.1 uses an explicit registry of critical jobs with expected maximum silence thresholds.

CREATE TABLE IF NOT EXISTS raw_ops.critical_cron_registry (
  job_name TEXT PRIMARY KEY,
  display_name TEXT NOT NULL,
  max_silence_minutes INTEGER NOT NULL,
  category TEXT NOT NULL DEFAULT 'operations',
  created_at TIMESTAMPTZ DEFAULT now()
);

INSERT INTO raw_ops.critical_cron_registry (job_name, display_name, max_silence_minutes, category)
VALUES
  ('process-outbox-every-minute', 'Email sender', 5, 'email'),
  ('process-lifecycle-events', 'Lifecycle events', 5, 'email'),
  ('monitor-outbox-health', 'Outbox health check', 10, 'monitoring'),
  ('monitor-order-ingestion-slo', 'Order SLO check', 10, 'monitoring'),
  ('monitor-heartbeat', 'Monitoring heartbeat', 20, 'monitoring'),
  ('monitor-webhook-health', 'Webhook health check', 15, 'monitoring'),
  ('monitor-stock-levels', 'Stock level check', 20, 'inventory'),
  ('monitor-lab-ingestion', 'Lab ingestion check', 20, 'lab'),
  ('dunning-processor', 'Payment dunning', 30, 'subscriptions'),
  ('canary-system-health', 'Infrastructure canary', 15, 'infrastructure'),
  ('process-refund-retries', 'Refund retry processor', 5, 'refunds'),
  ('process-proof-jobs', 'Proof page generator', 5, 'proof'),
  ('process-proof-jobs', 'Proof page generator', 5, 'proof'),
  ('process-lifecycle-events', 'Lifecycle event sender', 5, 'email')
ON CONFLICT (job_name) DO NOTHING;

CREATE OR REPLACE FUNCTION raw_ops.fn_get_cron_health_summary_v1()
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
  RETURN (
    SELECT
      jsonb_build_object(
      'total_critical', COUNT(*),
      'healthy', COUNT(*) FILTER (WHERE health = 'healthy'),
      'stale', COUNT(*) FILTER (WHERE health = 'stale'),
      'critical', COUNT(*) FILTER (WHERE health IN ('failed', 'overdue', 'missing', 'never_run')),
      'disabled', COUNT(*) FILTER (WHERE health = 'disabled'),
      'jobs', jsonb_agg(job_detail ORDER BY health DESC, display_name)
    )
    FROM (
      SELECT
        r.job_name,
        r.display_name,
        r.max_silence_minutes,
        r.category,
        j.active,
        lr.start_time as last_run,
        lr.status as last_status,
        CASE
          WHEN j.jobid IS NULL THEN 'missing'
          WHEN j.active = false THEN 'disabled'
          WHEN lr.status = 'failed' THEN 'failed'
          WHEN lr.start_time IS NULL THEN 'never_run'
          WHEN lr.start_time < now() - (r.max_silence_minutes * 2 || ' minutes')::interval THEN 'overdue'
          WHEN lr.start_time < now() - (r.max_silence_minutes || ' minutes')::interval THEN 'stale'
          ELSE 'healthy'
        END as health,
        jsonb_build_object(
          'job_name', r.job_name,
          'display_name', r.display_name,
          'category', r.category,
          'active', COALESCE(j.active, false),
          'last_run', lr.start_time,
          'last_status', lr.status,
          'last_error', CASE WHEN lr.status = 'failed'
            THEN LEFT(lr.return_message, 200) ELSE NULL END,
          'max_silence_minutes', r.max_silence_minutes,
          'health', CASE
            WHEN j.jobid IS NULL THEN 'missing'
            WHEN j.active = false THEN 'disabled'
            WHEN lr.status = 'failed' THEN 'failed'
            WHEN lr.start_time IS NULL THEN 'never_run'
            WHEN lr.start_time < now() - (r.max_silence_minutes * 2 || ' minutes')::interval THEN 'overdue'
            WHEN lr.start_time < now() - (r.max_silence_minutes || ' minutes')::interval THEN 'stale'
            ELSE 'healthy'
          END
        ) as job_detail
      FROM raw_ops.critical_cron_registry r
      LEFT JOIN cron.job j ON j.jobname = r.job_name
      LEFT JOIN LATERAL (
        SELECT * FROM cron.job_run_details jrd
        WHERE jrd.jobid = j.jobid
        ORDER BY jrd.start_time DESC LIMIT 1
      ) lr ON true
    ) health_check
  );
END;
$$;

UI Specification

Auto-refresh: Every 60 seconds. Manual refresh button.

Tile interaction: Clicking any amber/red tile scrolls to and expands the relevant Exception Queue panel.

Tile layout: 4 columns desktop, 2 tablet, 1 mobile. Sort: red first, amber second, green last.

Colour tokens: Green #2D5144 on #E8F5E9, Amber #F57F17 on #FFF8E1, Red #C62828 on #FFEBEE.


Part 3: Exception Queue

Design Principles

  1. Every action has an explicit execution path. "What picks it up, when, and what happens if it doesn't" is documented for every state-flip.
  2. No bulk "retry all" buttons. Operator selects specific items or uses capped batches with confirmation.
  3. Human-first labels. UUIDs hidden by default. Error types shown as plain English. Technical detail in expandable panel.
  4. Confirmation dialogs on every mutation with downstream impact description.

3.1 Lab Quarantine Viewer

Source: raw_ops.lab_email_failures WHERE status = 'pending_review'

Columns (human labels):

Display Label Source
Received created_at (formatted: "2 hours ago")
Email Subject email_subject
Problem Human label from error_type (see table below)
Batch (if known) Extracted from email subject or batch_code field

Error type → human label:

error_type Display Guidance Text
batch_not_found Batch not created yet Create the batch in the Batches tab first, then retry.
batch_not_in_qa_hold Batch already processed This batch was already released or rejected. Check batch status.
ai_parsing_failed Could not read the PDF Open the PDF manually and enter results using the form below.
incomplete_parse PDF partially readable Some data was missing. Open the PDF and enter results manually.
no_pdf_attachment No PDF in email This email matched the lab filter but had no PDF attached. Dismiss if not a lab result.
storage_upload_failed System error Escalate to founder. This is an infrastructure issue.
db_write_failed System error Escalate to founder. This is an infrastructure issue.

Actions:

"View PDF" — Opens PDF from Supabase Storage. No mutation.

"Retry" — Re-queues for processing. Enabled only for batch_not_found, ai_parsing_failed, and incomplete_parse.

Execution path: Sets status to retry_queued. The lab-email-ingestion Edge Function's 30-minute fallback poll picks up retry_queued items and reprocesses them. If the Edge Function is not running, the lab_ingestion_health monitor will fire a warning within 15 minutes.

CREATE OR REPLACE FUNCTION raw_ops.fn_retry_lab_quarantine_v1(
  p_failure_id UUID,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_failure RECORD;
BEGIN
  SELECT * INTO v_failure
  FROM lab_email_failures
  WHERE id = p_failure_id AND status = 'pending_review';

  IF v_failure IS NULL THEN
    RETURN jsonb_build_object('success', false, 'error', 'Item not found or not pending review');
  END IF;

  IF v_failure.error_type IN ('storage_upload_failed', 'db_write_failed') THEN
    RETURN jsonb_build_object('success', false,
      'error', 'This is an infrastructure error. Please escalate to the founder.');
  END IF;

  -- Only allow retry for error types where retry has a reasonable chance of success
  IF v_failure.error_type NOT IN ('batch_not_found', 'ai_parsing_failed', 'incomplete_parse') THEN
    RETURN jsonb_build_object('success', false,
      'error', 'This error type (' || v_failure.error_type || ') cannot be resolved by retry. '
        || 'Use "Enter Results Manually" or "Dismiss" instead.');
  END IF;

  UPDATE lab_email_failures
  SET status = 'retry_queued', updated_at = now()
  WHERE id = p_failure_id;

  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('LAB_QUARANTINE', p_failure_id, 'INFO', 'Quarantined lab email queued for retry',
    jsonb_build_object('agent_id', p_agent_id, 'error_type', v_failure.error_type));

  RETURN jsonb_build_object('success', true, 'failure_id', p_failure_id);
END;
$$;

"Enter Results Manually" — Opens manual lab result form. See Part 4.

"Dismiss" — For non-lab emails. Requires a reason.

CREATE OR REPLACE FUNCTION raw_ops.fn_dismiss_lab_quarantine_v1(
  p_failure_id UUID,
  p_reason TEXT,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
  UPDATE lab_email_failures
  SET status = 'dismissed', resolved_at = now(),
      resolved_by = p_agent_id::text,
      resolution_notes = p_reason
  WHERE id = p_failure_id AND status = 'pending_review';

  IF NOT FOUND THEN
    RETURN jsonb_build_object('success', false, 'error', 'Item not found or not pending');
  END IF;

  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('LAB_QUARANTINE', p_failure_id, 'INFO', 'Quarantined email dismissed',
    jsonb_build_object('agent_id', p_agent_id, 'reason', p_reason));

  RETURN jsonb_build_object('success', true);
END;
$$;

3.2 Dead-Lettered Refunds

Source: raw_ops.refunds WHERE dead_lettered_at IS NOT NULL

Columns: Customer (email), Order (order number), Amount, Type, Failure Reason, When

Expandable detail: Shopify error message, retry count, refund ID (for Shopify Admin lookup).

Actions:

"Send again" — Resets refund to pending for automatic retry.

Execution path: Sets status to pending, clears dead-letter fields. The on_refund_status_change database trigger fires, which calls process-refund Edge Function via pg_net within seconds. If the Edge Function is down, the refund stays in pending and the refund_health monitor fires within 15 minutes.

CREATE OR REPLACE FUNCTION raw_ops.fn_retry_dead_lettered_refund_v1(
  p_refund_id UUID,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_refund RECORD;
BEGIN
  SELECT * INTO v_refund FROM refunds WHERE id = p_refund_id;

  IF v_refund IS NULL OR v_refund.dead_lettered_at IS NULL THEN
    RETURN jsonb_build_object('success', false, 'error', 'Refund not found or not dead-lettered');
  END IF;

  UPDATE refunds
  SET status = 'pending',
      dead_lettered_at = NULL,
      dead_letter_reason = NULL,
      retry_count = 0,
      next_retry_at = NULL,
      shopify_error = NULL
  WHERE id = p_refund_id;

  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('REFUND', p_refund_id, 'INFO', 'Dead-lettered refund retried via portal',
    jsonb_build_object('agent_id', p_agent_id, 'amount_gbp', v_refund.amount_gbp));

  RETURN jsonb_build_object('success', true, 'refund_id', p_refund_id);
END;
$$;

"Record manual refund" — For cases where the Ops Lead processed the refund directly in Shopify Admin.

Form: Shopify refund ID (from Shopify Admin), confirmation checkbox ("I have confirmed this refund was processed in Shopify").

CREATE OR REPLACE FUNCTION raw_ops.fn_resolve_refund_manually_v1(
  p_refund_id UUID,
  p_shopify_refund_id TEXT,
  p_notes TEXT,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
  UPDATE refunds
  SET status = 'processed',
      dead_lettered_at = NULL,
      dead_letter_reason = NULL,
      shopify_refund_id = p_shopify_refund_id,
      internal_notes = COALESCE(internal_notes, '') ||
        E'\n[Manual: ' || p_notes || ' | agent: ' || p_agent_id || ' | ' || now()::text || ']'
  WHERE id = p_refund_id;

  IF NOT FOUND THEN
    RETURN jsonb_build_object('success', false, 'error', 'Refund not found');
  END IF;

  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('REFUND', p_refund_id, 'INFO', 'Refund resolved manually via portal',
    jsonb_build_object('agent_id', p_agent_id, 'shopify_refund_id', p_shopify_refund_id));

  RETURN jsonb_build_object('success', true);
END;
$$;

3.3 Failed Webhooks (Safe Retry)

Replaces v1.0's "Retry All Failed" with safe operator controls.

Source: webhook_inbox WHERE processing_status = 'failed' and seal_webhook_inbox WHERE processing_status = 'failed'

Two sub-panels: Shopify Webhooks, Seal Webhooks. Same UI pattern for both.

Columns: Received (relative time), Type (webhook topic, human-labelled), Error (first 100 chars), Attempts

UI Controls:

  1. Summary bar: "12 failed webhooks. 8 are the same error: Connection timeout."
  2. Error grouping: Group by error message. Show count per group.
  3. Select: Checkbox per item. "Select all in this group" shortcut.
  4. "Retry selected" button: Max 25 per action. Confirmation: "Retry [N] failed webhooks? If the underlying issue is still present, these will fail again."
  5. "Retry next 10" button: Takes the 10 oldest failed items only.
CREATE OR REPLACE FUNCTION raw_ops.fn_retry_selected_webhooks_v1(
  p_webhook_ids UUID[],
  p_agent_id UUID,
  p_source TEXT DEFAULT 'shopify'  -- 'shopify' or 'seal'
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_count INTEGER;
  v_capped_ids UUID[];
BEGIN
  -- Validate source
  IF p_source NOT IN ('shopify', 'seal') THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Invalid source: ' || p_source || '. Must be shopify or seal.');
  END IF;

  -- Hard cap at 25
  v_capped_ids := p_webhook_ids[1:25];

  IF p_source = 'shopify' THEN
    UPDATE webhook_inbox
    SET processing_status = 'pending', retry_count = retry_count + 1
    WHERE id = ANY(v_capped_ids) AND processing_status = 'failed' AND retry_count < 10;
  ELSIF p_source = 'seal' THEN
    UPDATE seal_webhook_inbox
    SET processing_status = 'pending', retry_count = retry_count + 1
    WHERE id = ANY(v_capped_ids) AND processing_status = 'failed' AND retry_count < 10;
  END IF;

  GET DIAGNOSTICS v_count = ROW_COUNT;

  INSERT INTO ops_events (entity_type, kind, message, meta)
  VALUES ('WEBHOOK', 'INFO', v_count || ' ' || p_source || ' webhooks queued for retry',
    jsonb_build_object('agent_id', p_agent_id, 'count', v_count,
      'source', p_source, 'requested', array_length(p_webhook_ids, 1)));

  RETURN jsonb_build_object('success', true, 'retried', v_count,
    'capped', array_length(p_webhook_ids, 1) > 25);
END;
$$;

Execution path: Setting processing_status = 'pending' causes the webhook handler (triggered by the existing webhook processing cron or real-time trigger) to reprocess them. The webhook_health monitor checks every 5 minutes.


3.4 Email Delivery Recovery

Source: email_outbox WHERE status = 'failed'

Note: v1.0 had a mismatch between source (stuck pending emails) and function (retried failed emails). v1.1 fixes this.

Panel shows two metrics: - Failed emails (status = 'failed'): Can be retried. - Stuck emails (status = 'pending', age > 30 minutes): Cannot be retried (they're already pending). Indicates the email worker itself is stalled. Guidance: "Check Background Jobs tile. If the email sender job is stale or failed, use 'Run email sender now' in Worker Controls."

Action: "Retry failed emails" (max 25)

CREATE OR REPLACE FUNCTION raw_ops.fn_retry_failed_outbox_v1(
  p_limit INTEGER DEFAULT 25,
  p_agent_id UUID DEFAULT NULL
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_count INTEGER;
BEGIN
  WITH retried AS (
    UPDATE email_outbox
    SET status = 'pending', updated_at = now(),
        retry_count = COALESCE(retry_count, 0) + 1
    WHERE id IN (
      SELECT id FROM email_outbox
      WHERE status = 'failed' AND COALESCE(retry_count, 0) < 5
      ORDER BY created_at ASC
      LIMIT p_limit
    )
    RETURNING id
  )
  SELECT COUNT(*) INTO v_count FROM retried;

  INSERT INTO ops_events (entity_type, kind, message, meta)
  VALUES ('EMAIL_OUTBOX', 'INFO', v_count || ' failed emails queued for retry',
    jsonb_build_object('agent_id', p_agent_id, 'count', v_count));

  RETURN jsonb_build_object('success', true, 'retried', v_count);
END;
$$;

Execution path: Emails set to pending are picked up by the process-outbox cron job (runs every minute). If that job is stalled, the outbox health monitor fires within 2 minutes.


3.5 Inventory Adjustment Form

Unchanged from v1.0 except: uses agent_id from auth, not free-text.

Calls existing raw_ops.execute_manual_inventory_adjustment() with agent_id resolved from JWT.

Confirmation dialog: "You are about to adjust batch [CODE] by [±X] kg. Current stock: [Y] kg. After adjustment: [Z] kg. Reason: [reason]. This cannot be undone. Continue?"


3.6 Referral Fraud Review Panel

Unchanged from v1.0 except: uses agent_id from auth.

Calls existing raw_ops.manual_override_referral() with agent_id.


3.7 Dunning Pipeline Viewer

Unchanged from v1.0 except: uses agent_id from auth, and Day 10+ rows show warning: "This subscription will auto-pause tomorrow. Consider calling the customer directly."


3.8 Stuck Proof Jobs

Source: raw_ops.proof_jobs WHERE state IN ('queued', 'failed') AND created_at < now() - interval '5 minutes'

When: Batch created but no QR code or proof page generated. The proof_system_health monitor fires.

Columns: Batch Code, Created, State, Attempts, Last Error

Action: "Retry" — Resets proof job for reprocessing.

Execution path: Sets state to queued and resets attempts. The process-proof-jobs Edge Function (invoked by pg_cron every minute) picks up queued jobs. If the proof worker itself is stalled, the Ops Lead should also use "Run proof generator now" in Worker Controls.

CREATE OR REPLACE FUNCTION raw_ops.fn_retry_stuck_proof_job_v1(
  p_batch_id UUID,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_job_id UUID;
BEGIN
  -- Reset existing stuck job
  UPDATE proof_jobs
  SET state = 'queued', attempts = 0, last_error = NULL
  WHERE batch_id = p_batch_id AND state IN ('queued', 'failed')
  RETURNING id INTO v_job_id;

  IF v_job_id IS NULL THEN
    -- Create new job if none exists
    INSERT INTO proof_jobs (batch_id, job_key, state)
    VALUES (p_batch_id, 'proof:batch:' || p_batch_id, 'queued')
    RETURNING id INTO v_job_id;
  END IF;

  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('PROOF_JOB', v_job_id, 'INFO', 'Proof job retried via portal',
    jsonb_build_object('agent_id', p_agent_id, 'batch_id', p_batch_id));

  RETURN jsonb_build_object('success', true, 'proof_job_id', v_job_id);
END;
$$;

Part 4: Manual Lab Result Entry (Redesigned)

Problem with v1.0

v1.0 captured a single Enterobacteriaceae value and declared PASS if < 5000. The actual lab certificate contains an official verdict and sub-sample data. A single-value check can produce a false PASS.

v1.2 Design: Lab Verdict First

The Ops Lead enters what the lab certificate says. The system records it. The lab's official verdict governs batch release, not a portal-side calculation.

Form Fields:

Field Type Required Notes
Batch Dropdown (QA_HOLD batches only) Yes
Lab Name Text (default: i2 Analytical) Yes
Certificate Date Date picker Yes
Certificate Reference Text No
Lab Verdict Dropdown: PASS / FAIL Yes As stated on the certificate
Salmonella (25g) Dropdown: Not Detected / Detected Yes
Listeria (25g) Dropdown: Not Detected / Detected Yes
Enterobacteriaceae Number (cfu/g, official value) Yes
Lab Certificate PDF File upload Yes Stored in Supabase Storage. Required for both PASS and FAIL.
Linked Quarantine Item Auto-populated if opened from quarantine Auto

Safety Guards:

  1. Lab verdict is the governing field. However, a PASS verdict is hard-blocked if pathogens are detected. If the operator selects PASS but enters Salmonella = Detected or Listeria = Detected, the form refuses submission: "Cannot submit PASS with a detected pathogen. Check the certificate. If the lab certificate genuinely says PASS despite these values, escalate to the founder." This is a safety invariant, not a UX choice.

  2. One result per batch. Idempotency guard prevents double-submit and prevents manual entry when an automated result already exists:

-- One lab result per batch, period. Prevents both double-manual and manual-after-automated.
CREATE UNIQUE INDEX IF NOT EXISTS idx_lab_results_one_per_batch
  ON raw_ops.lab_results (batch_id);
  1. Batch must be in QA_HOLD. Function rejects if batch is already released/rejected.

  2. PDF required for both PASS and FAIL. Failed-batch evidence is as important as pass evidence for compliance.

CREATE OR REPLACE FUNCTION raw_ops.fn_manual_lab_result_entry_v1(
  p_batch_code TEXT,
  p_lab_name TEXT,
  p_reported_at DATE,
  p_certificate_ref TEXT,
  p_lab_verdict TEXT,          -- 'PASS' or 'FAIL' (from certificate)
  p_salmonella TEXT,           -- 'absent' or 'detected'
  p_listeria TEXT,             -- 'absent' or 'detected'
  p_enterobacteriaceae NUMERIC,
  p_certificate_pdf_path TEXT,
  p_agent_id UUID,
  p_quarantine_id UUID DEFAULT NULL
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_batch RECORD;
  v_lab_result_id UUID;
BEGIN
  -- Validate batch exists and is in QA_HOLD
  SELECT * INTO v_batch FROM batches WHERE batch_code = p_batch_code;
  IF v_batch IS NULL THEN
    RETURN jsonb_build_object('success', false, 'error', 'Batch not found');
  END IF;
  IF v_batch.status != 'QA_HOLD' THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Batch is ' || v_batch.status || ', not QA_HOLD. Cannot enter results.');
  END IF;

  -- Hard-block: PASS with detected pathogen is not allowed
  IF p_lab_verdict = 'PASS' AND (p_salmonella = 'detected' OR p_listeria = 'detected') THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Cannot record PASS with a detected pathogen. '
        || 'Check the certificate. If the lab certificate genuinely says PASS despite these values, '
        || 'escalate to the founder for manual review.');
  END IF;

  -- Insert lab result. The on_lab_result_insert trigger handles batch release/rejection.
  -- Lab verdict governs the outcome, not portal-side calculation.
  INSERT INTO lab_results (
    batch_id, lab_name, reported_at, certificate_reference,
    salmonella_result, listeria_result, enterobacteriaceae_cfu_g,
    outcome, entered_by, entry_method, certificate_pdf_path,
    source_email_id  -- reuse for idempotency (NULL for manual, unique index on batch_id + entry_method)
  ) VALUES (
    v_batch.id, p_lab_name, p_reported_at, p_certificate_ref,
    p_salmonella, p_listeria, p_enterobacteriaceae,
    p_lab_verdict, p_agent_id::text, 'manual_portal', p_certificate_pdf_path
  )
  RETURNING id INTO v_lab_result_id;

  -- Mark quarantine item as resolved if linked
  IF p_quarantine_id IS NOT NULL THEN
    UPDATE lab_email_failures
    SET status = 'resolved_manual', resolved_at = now(),
        resolved_by = p_agent_id::text
    WHERE id = p_quarantine_id;
  END IF;

  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('LAB_RESULT', v_lab_result_id, 'INFO', 'Lab result entered manually via portal',
    jsonb_build_object(
      'agent_id', p_agent_id, 'batch_code', p_batch_code,
      'lab_verdict', p_lab_verdict
    ));

  RETURN jsonb_build_object(
    'success', true,
    'lab_result_id', v_lab_result_id,
    'batch_code', p_batch_code,
    'lab_verdict', p_lab_verdict,
    'batch_new_status', CASE WHEN p_lab_verdict = 'PASS' THEN 'RELEASED' ELSE 'REJECTED' END
  );
EXCEPTION
  WHEN unique_violation THEN
    RETURN jsonb_build_object('success', false,
      'error', 'A lab result has already been entered for this batch. Check the Batches tab.');
END;
$$;

Execution path: The INSERT INTO lab_results fires the existing on_lab_result_insert trigger, which handles batch status change to RELEASED or REJECTED, order allocation (if PASS), proof portal update, and partner notification. This is the same path as automated lab processing. The manual form is just a different entry point into the same pipeline.


Part 5: Worker Controls

Purpose

When a background worker is stalled, the Ops Lead needs a "run it now" button, not just visibility that it's broken.

Available Controls

Button Label What It Does When To Use
Run email sender now Invokes process-outbox Edge Function Emails stuck as pending, email worker cron stale
Run proof generator now Invokes process-proof-jobs Edge Function Proof jobs stuck as queued
Run webhook processor now Invokes shopify-webhook reprocessor Webhooks stuck as pending after retry
Run dunning processor now Invokes process-dunning Edge Function Dunning emails not sending on schedule
Run lifecycle processor now Invokes send-lifecycle-events Edge Function Lifecycle events not sending on schedule

Implementation

Each button calls an Edge Function via the Ops Portal's authenticated Supabase client. The Edge Functions already exist; the portal just needs a way to invoke them manually.

CREATE OR REPLACE FUNCTION raw_ops.fn_trigger_worker_run_v1(
  p_worker TEXT,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
  v_allowed_workers TEXT[] := ARRAY[
    'process-outbox', 'process-proof-jobs', 'process-dunning',
    'shopify-webhook', 'send-lifecycle-events'
  ];
BEGIN
  IF NOT (p_worker = ANY(v_allowed_workers)) THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Worker not in allowed list: ' || p_worker);
  END IF;

  -- Use pg_net to invoke the Edge Function (same pattern as fn_invoke_monitor)
  PERFORM raw_ops.fn_invoke_edge_function(p_worker, '{}'::jsonb);

  INSERT INTO ops_events (entity_type, kind, message, meta)
  VALUES ('WORKER', 'INFO', 'Worker manually triggered: ' || p_worker,
    jsonb_build_object('agent_id', p_agent_id, 'worker', p_worker));

  RETURN jsonb_build_object('success', true, 'worker', p_worker, 'invoked_at', now());
END;
$$;

Confirmation dialog: "This will run the [worker name] immediately. Use this when the background job appears stalled. Continue?"

Cooldown: Button disabled for 60 seconds after invocation to prevent double-firing.


Part 6: Missing-Event Recovery

Problem

v1.0 handled failed rows. It did not handle events that never arrived.

6.1 Pull Missing Order from Shopify

When: A customer says their order went through but it's not in the system. Or the order_ingestion_slo monitor fires but webhook_inbox shows no failed webhooks (meaning the webhook never arrived).

Form: Single field: Shopify Order Number (e.g., #1042)

Implementation: Calls a new Edge Function that fetches the order from Shopify Admin API and replays it through the standard ingestion pipeline.

Portal → ops_replay_shopify_order RPC → pg_net → replay-shopify-order Edge Function
Edge Function:
  1. GET /admin/api/2024-01/orders/{order_id}.json (by order number)
  2. Transform to webhook payload format
  3. Call fn_process_shopify_order_v3() (same function used by live webhooks)
  4. Return result
CREATE OR REPLACE FUNCTION raw_ops.fn_request_order_replay_v1(
  p_order_number TEXT,
  p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
  -- Check if order already exists
  IF EXISTS (SELECT 1 FROM orders WHERE order_number_label = p_order_number) THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Order ' || p_order_number || ' already exists in the system.');
  END IF;

  -- Invoke the replay Edge Function via pg_net
  PERFORM raw_ops.fn_invoke_edge_function('replay-shopify-order',
    jsonb_build_object('order_number', p_order_number));

  INSERT INTO ops_events (entity_type, kind, message, meta)
  VALUES ('ORDER', 'INFO', 'Order replay requested: ' || p_order_number,
    jsonb_build_object('agent_id', p_agent_id, 'order_number', p_order_number));

  RETURN jsonb_build_object('success', true,
    'message', 'Order replay requested. Check the order list in 30 seconds.');
END;
$$;

Note: The replay-shopify-order Edge Function is a new function that needs to be built. It is a thin wrapper: fetch from Shopify Admin API, transform, call the existing processing function.


Part 7: Compliance Recording

7.1-7.4: Unchanged from v1.0

Temperature Audit Form, Proof Portal Spot Check, HACCP Evidence Pack Logger, Lab Sample Submission Log. All unchanged except: all use agent_id from auth, not free-text.

7.5 Rejected Batch Disposition (New)

When: A batch fails lab testing (status = REJECTED). Someone must record what happened to the physical stock.

Form Fields:

Field Type Required
Batch Code Dropdown (REJECTED batches without disposition) Yes
Disposition Action Dropdown: Destroyed / Returned to Supplier / Held for Re-test Yes
Quantity Disposed (kg) Number Yes
Disposal Method Text (e.g., "Commercial waste contractor") Yes if Destroyed
Witnessed By Text Yes if Destroyed
Date Date Yes
Notes Text No
Evidence (photos/docs) File upload No
CREATE TABLE IF NOT EXISTS raw_ops.batch_dispositions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  batch_id UUID NOT NULL REFERENCES raw_ops.batches(id),
  batch_code TEXT NOT NULL,
  disposition_action TEXT NOT NULL
    CHECK (disposition_action IN ('destroyed', 'returned_to_supplier', 'held_for_retest')),
  quantity_kg NUMERIC NOT NULL,
  disposal_method TEXT,
  witnessed_by TEXT,
  disposition_date DATE NOT NULL,
  evidence_paths TEXT[],
  notes TEXT,
  recorded_by UUID NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE UNIQUE INDEX idx_batch_disposition_one_per_batch
  ON raw_ops.batch_dispositions(batch_id);

7.6 Compliance Insert Functions

All compliance forms use the same pattern: validate inputs, insert record, log to ops_events, return success.

-- Temperature Audit
CREATE OR REPLACE FUNCTION raw_ops.fn_record_temperature_audit_v1(
  p_audit_date DATE, p_shipment_id UUID, p_logger_serial TEXT,
  p_max_temp NUMERIC, p_min_temp NUMERIC, p_duration_above_minus5 INTEGER,
  p_evidence_paths TEXT[], p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_pass BOOLEAN; v_id UUID;
BEGIN
  v_pass := (p_max_temp <= -5.0) OR (p_max_temp <= 0.0 AND COALESCE(p_duration_above_minus5, 0) < 30);
  INSERT INTO temperature_audit_records (audit_date, shipment_id, logger_serial,
    max_temp_celsius, min_temp_celsius, duration_above_minus5_mins, pass,
    evidence_paths, notes, recorded_by)
  VALUES (p_audit_date, p_shipment_id, p_logger_serial, p_max_temp, p_min_temp,
    p_duration_above_minus5, v_pass, p_evidence_paths, p_notes, p_agent_id::text)
  RETURNING id INTO v_id;
  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('TEMP_AUDIT', v_id, 'INFO', 'Temperature audit recorded',
    jsonb_build_object('agent_id', p_agent_id, 'pass', v_pass, 'shipment_id', p_shipment_id));
  RETURN jsonb_build_object('success', true, 'id', v_id, 'pass', v_pass);
END; $$;

-- Proof Portal Spot Check
CREATE OR REPLACE FUNCTION raw_ops.fn_record_proof_spot_check_v1(
  p_check_date DATE, p_batch_code TEXT, p_qr_ok BOOLEAN, p_page_ok BOOLEAN,
  p_results_ok BOOLEAN, p_pdf_ok BOOLEAN, p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_id UUID;
BEGIN
  INSERT INTO proof_portal_spot_checks (check_date, batch_code, qr_scan_ok,
    page_loads_ok, results_display_ok, pdf_download_ok, notes, checked_by)
  VALUES (p_check_date, p_batch_code, p_qr_ok, p_page_ok, p_results_ok, p_pdf_ok,
    p_notes, p_agent_id::text)
  RETURNING id INTO v_id;
  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('PROOF_SPOT_CHECK', v_id, 'INFO', 'Proof portal spot check recorded',
    jsonb_build_object('agent_id', p_agent_id, 'batch_code', p_batch_code,
      'all_pass', p_qr_ok AND p_page_ok AND p_results_ok AND p_pdf_ok));
  RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;

-- HACCP Evidence Pack
CREATE OR REPLACE FUNCTION raw_ops.fn_record_evidence_pack_v1(
  p_ship_date DATE, p_ship_day_type TEXT,
  p_pick_list BOOLEAN, p_qc_record BOOLEAN, p_packout_sheet BOOLEAN,
  p_photo_proofs BOOLEAN, p_logger_serials BOOLEAN, p_courier_manifest BOOLEAN,
  p_file_paths TEXT[], p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_id UUID;
BEGIN
  -- Normalize to lowercase to match CHECK constraint
  INSERT INTO evidence_packs (ship_date, ship_day_type, pick_list_filed, qc_record_filed,
    packout_sheet_filed, photo_proofs_filed, logger_serials_recorded, courier_manifest_filed,
    file_paths, notes, recorded_by)
  VALUES (p_ship_date, lower(trim(p_ship_day_type)), p_pick_list, p_qc_record, p_packout_sheet,
    p_photo_proofs, p_logger_serials, p_courier_manifest, p_file_paths, p_notes, p_agent_id::text)
  RETURNING id INTO v_id;
  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('EVIDENCE_PACK', v_id, 'INFO', 'Evidence pack recorded',
    jsonb_build_object('agent_id', p_agent_id, 'ship_date', p_ship_date));
  RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;

-- Lab Sample Submission
CREATE OR REPLACE FUNCTION raw_ops.fn_record_sample_submission_v1(
  p_batch_code TEXT, p_sample_sent_date DATE, p_lab_name TEXT,
  p_tracking_ref TEXT, p_trays_sampled INTEGER, p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_batch_id UUID; v_id UUID; v_batch_status TEXT;
BEGIN
  SELECT id, status INTO v_batch_id, v_batch_status FROM batches WHERE batch_code = p_batch_code;
  IF v_batch_id IS NULL THEN
    RETURN jsonb_build_object('success', false, 'error', 'Batch not found');
  END IF;
  IF v_batch_status != 'QA_HOLD' THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Batch is ' || v_batch_status || ', not in testing. Sample submission is only for batches awaiting lab results.');
  END IF;
  IF (SELECT status FROM batches WHERE id = v_batch_id) != 'QA_HOLD' THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Samples can only be logged for batches in testing (QA_HOLD)');
  END IF;
  INSERT INTO lab_sample_submissions (batch_id, batch_code, sample_sent_date,
    lab_name, tracking_reference, trays_sampled, notes, recorded_by)
  VALUES (v_batch_id, p_batch_code, p_sample_sent_date, p_lab_name,
    p_tracking_ref, p_trays_sampled, p_notes, p_agent_id::text)
  RETURNING id INTO v_id;
  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('LAB_SAMPLE', v_id, 'INFO', 'Lab sample submission recorded',
    jsonb_build_object('agent_id', p_agent_id, 'batch_code', p_batch_code));
  RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;

-- Batch Disposition
CREATE OR REPLACE FUNCTION raw_ops.fn_record_batch_disposition_v1(
  p_batch_code TEXT, p_disposition_action TEXT, p_quantity_kg NUMERIC,
  p_disposal_method TEXT, p_witnessed_by TEXT, p_disposition_date DATE,
  p_evidence_paths TEXT[], p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_batch RECORD; v_id UUID;
BEGIN
  SELECT * INTO v_batch FROM batches WHERE batch_code = p_batch_code;
  IF v_batch IS NULL THEN
    RETURN jsonb_build_object('success', false, 'error', 'Batch not found');
  END IF;
  IF v_batch.status != 'REJECTED' THEN
    RETURN jsonb_build_object('success', false,
      'error', 'Only rejected batches can have a disposition record');
  END IF;
  -- Enforce witness and disposal method for destroyed batches
  IF p_disposition_action = 'destroyed' THEN
    IF p_disposal_method IS NULL OR trim(p_disposal_method) = '' THEN
      RETURN jsonb_build_object('success', false,
        'error', 'Disposal method is required when batch is destroyed');
    END IF;
    IF p_witnessed_by IS NULL OR trim(p_witnessed_by) = '' THEN
      RETURN jsonb_build_object('success', false,
        'error', 'Witness name is required when batch is destroyed');
    END IF;
  END IF;
  INSERT INTO batch_dispositions (batch_id, batch_code, disposition_action, quantity_kg,
    disposal_method, witnessed_by, disposition_date, evidence_paths, notes, recorded_by)
  VALUES (v_batch.id, p_batch_code, p_disposition_action, p_quantity_kg,
    p_disposal_method, p_witnessed_by, p_disposition_date, p_evidence_paths, p_notes, p_agent_id)
  RETURNING id INTO v_id;
  INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
  VALUES ('BATCH_DISPOSITION', v_id, 'INFO', 'Batch disposition recorded',
    jsonb_build_object('agent_id', p_agent_id, 'batch_code', p_batch_code,
      'action', p_disposition_action, 'quantity_kg', p_quantity_kg));
  RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;

Part 8: Canned Diagnostics

Unchanged from v1.0 structurally. 10 pre-built read-only views. All wrapped in fn_diagnostic_*_v1() functions returning JSONB arrays, rendered as sortable/filterable tables.


Part 9: UI Language Guidelines

Principle

The Ops Lead is a compliance-trained operator, not an engineer. Every label, message, and error should be written for someone who has never seen a database.

Rules

Instead of Write
UUID (hide entirely, show order number or customer email)
dead-lettered stuck (needs attention)
retry_count attempts
webhook_topic event type (with human label: "New Order", "Subscription Updated")
processing_status = 'failed' failed
QA_HOLD Testing in progress
RELEASED Cleared for sale
REJECTED Failed testing
DEPLETED Out of stock
fn_retry_* (never shown, behind button label)

Error Messages to Operator

System Error Show to Operator
P0001: Batch not found That batch code wasn't found. Check for typos.
unique_violation This has already been done. Check the relevant tab.
P0002: Agent not found Your account isn't set up yet. Ask the founder to add you.
Network error Connection lost. Check your internet and try again.
500 from Edge Function Something went wrong on our end. Try once more. If it fails again, escalate to the founder.

Expandable Technical Detail

Every exception item has a "Show details" toggle that reveals: IDs, raw error messages, timestamps, retry counts. This is for when the Ops Lead is on a call with the founder describing an issue. It is collapsed by default.


Part 10: Alert-to-Resolution Mapping

Complete mapping of every MON-01 alert to portal action.

Monitor Severity Operator Action Founder-Only?
order_ingestion_slo warning/critical Check Order Pipeline tile → Retry failed webhooks → Pull missing order If persistent after retry
outbox_health warning/critical Check Email Delivery tile → Retry failed emails → Run email sender now If worker repeatedly crashes
stock_levels critical Inventory Adjustment Form. Call co-packer (Jamie) for reorder No
lab_sla warning Check Lab Processing tile → Lab Quarantine Viewer. Check Lab Sample Submission Log No
working_capital warning View Batch Testing tile. Review with founder if structural Yes (if systemic)
daily_snapshot info Review System Health Dashboard. No action unless red No
allocation_health warning Check Inventory tile. Inventory Adjustment if needed No
webhook_health warning Retry selected failed webhooks No
webhook_daily_digest info Review only No
monitoring_heartbeat critical Check Background Jobs. Escalate if heartbeat itself is down Yes
lab_ingestion_health warning/critical Lab Quarantine Viewer If storage_upload_failed
courier_watchdog_health warning Review in Support Workstation. Call DPD if systemic No
3pl_export_health warning Check Background Jobs. Manual export in Fulfillment tab No
support_health warning Assign and resolve tickets in Support Workstation No
dispatch_processing_health warning Check Background Jobs + Fulfillment tab No
ad_spend_freshness warning Check Make.com dashboard (view-only). Informational No
cost_tracking_health warning Check Make.com dashboard (view-only). Informational No
proof_system_health warning Stuck Proof Jobs panel (3.8). Run proof generator now (Worker Controls) If repeated failures
proof_portal_health warning Run Proof Portal Spot Check. Report to founder if page down Yes (if site down)
lifecycle_events_health warning Check Background Jobs. Run lifecycle processor now (Worker Controls) If repeatedly crashing
abandonment_health warning Check Background Jobs No
health_data_completeness info Informational only No
inventory_planning_health warning Review inventory diagnostics. Contact co-packer No
referral_system_health warning Referral Fraud Review Panel No
monitor_coverage info Informational. Escalate if new monitor needed Yes
canary_system_health critical Escalate to founder immediately Always
refund_health warning Dead-Lettered Refunds panel No

Part 11: Implementation Plan

Build Order

Phase What Days Gate
0 Auth model 1 Agent resolution function, RPC wrapper pattern, portal-side JWT extraction. Test with one sample action. ChatGPT review.
1a Dashboard data function + cron registry 1 All 13 tiles returning correct data. ChatGPT review of tile logic.
1b Dashboard UI 1.5 Tiles rendering, auto-refresh, click-to-expand working.
1c Lab Quarantine (most complex exception) 2 List, retry, dismiss, manual entry with idempotency guard. Full test with mock data.
1d Refund + Webhook + Outbox exception panels 2 Safe retry with caps, confirmation dialogs, execution path verified.
1e Remaining exceptions (inventory, referral, dunning) 1.5 Forms + actions working.
1f Worker Controls + Order Replay 1.5 Run-now buttons, replay Edge Function, cooldown.
1g Canned Diagnostics 1 10 views rendering as sortable tables.
2 Compliance Recording Forms 2 All 5 forms (temp audit, spot check, evidence pack, sample log, batch disposition).
Total ~13.5 days

Review Gates

Each phase ends with: 1. Claude Code implementation complete 2. Functional test with mock data 3. ChatGPT review of the implemented module (persona: Daniel) 4. Fix any gaps found 5. Move to next phase


Part 12: RPC Registry (Complete)

Public RPC Delegates To Auth Purpose
ops_get_system_health fn_get_system_health_dashboard_v1 Read-only Dashboard data
ops_get_cron_health fn_get_cron_health_summary_v1 Read-only Cron job status
ops_diagnostic_* (10 functions) fn_diagnostic_*_v1 Read-only Canned diagnostics
ops_retry_lab_quarantine fn_retry_lab_quarantine_v1 Agent resolved Retry quarantined lab email
ops_dismiss_lab_quarantine fn_dismiss_lab_quarantine_v1 Agent resolved Dismiss non-lab email
ops_manual_lab_result fn_manual_lab_result_entry_v1 Agent resolved Enter lab results manually
ops_retry_dead_lettered_refund fn_retry_dead_lettered_refund_v1 Agent resolved Retry stuck refund
ops_resolve_refund_manually fn_resolve_refund_manually_v1 Agent resolved Mark refund processed in Shopify
ops_retry_selected_webhooks fn_retry_selected_webhooks_v1 Agent resolved Retry selected failed webhooks
ops_retry_failed_outbox fn_retry_failed_outbox_v1 Agent resolved Retry failed emails
ops_mark_dunning_email_sent fn_ops_mark_dunning_email_sent_v1 Agent resolved Mark dunning email sent
ops_retry_proof_job fn_retry_stuck_proof_job_v1 Agent resolved Retry stuck proof job
ops_trigger_worker fn_trigger_worker_run_v1 Agent resolved Run background worker now
ops_request_order_replay fn_request_order_replay_v1 Agent resolved Pull missing order from Shopify
ops_execute_inventory_adjustment execute_manual_inventory_adjustment Agent resolved Manual stock correction
ops_override_referral manual_override_referral Agent resolved Approve/reject flagged referral
ops_record_temp_audit fn_record_temperature_audit_v1 Agent resolved Temperature audit entry
ops_record_spot_check fn_record_proof_spot_check_v1 Agent resolved Proof portal spot check
ops_record_evidence_pack fn_record_evidence_pack_v1 Agent resolved HACCP evidence pack
ops_record_sample_submission fn_record_sample_submission_v1 Agent resolved Lab sample log
ops_record_batch_disposition fn_record_batch_disposition_v1 Agent resolved Rejected batch disposal

Part 13: ChatGPT Review Prompt (Updated)

You are Daniel, joining Protocol Raw as Operations Lead at Month 7.

Background: Agent Onboarding Supervisor at RIA Money Transfers.
Strong process execution and compliance skills. Zero coding ability.
Never used a database, terminal, or IDE.

Your tools:
- Operations Portal (ops.protocolraw.co.uk)
- Slack (#ops-alerts, #ops-urgent, #daily-ops)
- Metabase dashboards (read-only)
- Customer.io dashboard (email campaign monitoring)
- Make.com dashboard (view-only)
- Google Drive (HACCP documentation)
- Phone/email (supplier, lab, 3PL coordination)

Physical operations responsibilities:
- 3PL/DPD coordination
- Lab sample logistics (composite samples to UKAS lab)
- HACCP documentation and FSA audit readiness
- Temperature logger audit programme (weekly, 10 shipments)
- Packaging material supply chain coordination
- Co-packer schedule alignment (with Jamie)
- Proof portal QC spot checks (monthly, 5 batches)
- Evidence pack filing per ship day
- Rejected batch disposal coordination and recording

This is v1.2, with targeted fixes after your review of v1.1 found:
- Auth transport was internally inconsistent (header vs argument)
- Server-side retry eligibility not enforced for lab quarantine
- Invalid webhook source fell through silently
- Cron health severity model did not match tile rule
- Batch disposition did not enforce witness/disposal for destroyed batches
- ship_day_type case mismatch between UI and schema
- Document drift (v1.1 labels remaining in a v1.2 document)

Review v1.2 and confirm whether v1.0 and v1.1 issues are resolved.
Then identify any remaining gaps:

1. Any scenario where you're stuck without calling the founder
   (that isn't explicitly classified as founder-only)
2. Any action that assumes technical knowledge
3. Any alert without a portal resolution path
4. Any physical workflow without a recording form
5. Any information shown without agency to act on it
6. Any confusing terminology
7. Any compliance obligation without evidence capture
8. Any execution path that is still implicit or hopeful
9. Any auth bypass or governance hole

Be thorough. Be adversarial. Find every gap.

Version History

Version Date Changes Author
1.0 2026-04-12 Initial specification Anton + Claude
1.1 2026-04-12 Complete rewrite: auth model, execution paths, lab entry safety, missing-event recovery, bulk retry safety, honest governing principle, batch disposition, UI language guidelines Anton + Claude
1.2 2026-04-12 Targeted fixes from ChatGPT review: auth section reworded honestly, dashboard contract completed (support tile, SLO metric), lab idempotency scope fixed, PASS+pathogen hard-blocked, worker allowlist aligned, proof job panel restored, all compliance RPCs completed. Final cleanup: auth transport unified (RPC argument only), server-side retry eligibility enforced, invalid webhook source rejected, cron severity model aligned to tile rules (stale/overdue distinction), batch disposition witness/disposal enforced for destroyed, ship_day_type normalized, all v1.1 drift removed Anton + Claude

Document Owner: Protocol Raw Operations Technical Owner: Anton (Founder, Systems Architect) Next Review: 2026-07-12


End of SOP-OPS-01 v1.2