SOP-OPS-01: Operations Self-Service Actions v1.2¶
Every operationally recoverable alert gets a safe portal action. Infrastructure and architecture alerts are explicitly classified as founder-only.
Document ID: SOP-OPS-01-v1.2 Version: 1.2 Status: ✅ Specification Ready (for current Ops Portal architecture) Last Updated: 2026-04-12 Owner: Protocol Raw Operations Replaces: SOP-OPS-01-v1.1 Review Date: 2026-07-12
Auth Note: Auth pattern follows established Ops Portal perimeter model (Cloudflare Access + browser-side service_role client). Actor email resolved and logged server-side for audit traceability. Fine-grained role gating deferred to Phase B when multiple operators are added.
Key Changes in v1.2¶
- ✅ Auth section reworded: honest about existing perimeter model, not claiming upgraded trust
- ✅ Dashboard contract completed: support tile added, SLO metric added to order pipeline
- ✅ Lab entry idempotency fixed: unique on batch_id regardless of entry method
- ✅ Lab entry hard-blocks PASS when pathogen detected (no longer a warning)
- ✅ Worker Controls allowlist aligned with spec table (webhook processor + lifecycle worker added)
- ✅ Proof job recovery panel restored (dropped in v1.1 by document drift)
- ✅ All 5 TBD compliance RPCs completed with insert functions
- ✅ "All RPC wrappers complete" claim is now true
- ✅ Status line reflects scope: "for current Ops Portal architecture"
Key Changes in v1.1¶
- ✅ Governing principle rewritten: honest about founder-only alerts
- ✅ Auth model: Cloudflare Access JWT → agent_id, no user-supplied actor fields
- ✅ Every action has an explicit execution path (what consumer picks up the state change)
- ✅ Manual lab entry redesigned: lab verdict first, idempotency guard, full micro inputs
- ✅ Bulk retry actions replaced with safe controls (preview, select, cap, confirm)
- ✅ Missing-event recovery added (replay order, run worker now)
- ✅ Dashboard function rewritten to match all tiles, cron health from registry
- ✅ Broken SQL fixed (webhook retry, outbox source/function mismatch)
- ✅ Rejected batch disposition form added
- ✅ UI language guidelines added (human-first, technical in expandable detail)
Purpose¶
The Operations Lead (joining Month 7-9) must be able to run day-to-day operations, resolve operationally recoverable alerts, and handle exceptions entirely from the Operations Portal.
Two classes of alert exist:
| Class | Examples | Resolution |
|---|---|---|
| Operator-recoverable | Failed webhook, stuck email, quarantined lab result, dead-lettered refund, inventory discrepancy, fraud-flagged referral | Portal action. No founder involvement. |
| Founder-only (infrastructure) | Canary system health failure, Edge Function crash/redeploy needed, database schema issue, Supabase Storage failure, repeated cron job failures after operator retry | Ops Lead escalates to founder with context from portal diagnostics. |
The Ops Lead should never need to access the Supabase dashboard, run SQL, use a terminal, or deploy code. The founder's involvement is reserved for architecture changes, code deployments, and infrastructure failures.
Scope¶
Applies to: Operations Portal (ops.protocolraw.co.uk), Portal v4.1+
New Modules:
| Module | Tab | Purpose |
|---|---|---|
| System Health Dashboard | System Health (new) | Morning check, all-system visibility |
| Exception Queue | System Health | Item-level actions for recoverable failures |
| Worker Controls | System Health | Run-now buttons for stalled background workers |
| Cron Job Status | System Health | Visibility into scheduled job health |
| Canned Diagnostics | System Health | Pre-built read-only views |
| Compliance Recording | Compliance (new) | Temperature audits, evidence packs, spot checks, sample logs, batch disposition |
Related SOPs: MON-01, LAB-01, INV-01, REF-01, ORD-01, ORD-02, ORD-03, SUB-00, CS-04, DLV-01, COLD-01, PACK-01, PROOF-00, PROOF-01
Part 1: Authentication Model¶
Existing Portal Pattern¶
The Ops Portal has used the same trust model since v2.0: Cloudflare Access (Google OAuth) as the perimeter gate, with browser-side Supabase service_role client for database calls. Every existing feature (batch creation, refund processing, support workstation, live chat) operates within this boundary.
SOP-OPS-01 inherits this pattern. It does not introduce a new auth architecture.
What this SOP improves over the existing portal pattern: Actor identity is now resolved server-side and logged on every mutation. Previously, portal actions either didn't log who performed them or used ad-hoc identifiers. The new modules ensure every action is attributable to a specific agent via the agents table.
What this does not do: This is not cryptographic caller verification. The browser sends the agent email extracted from the Cloudflare Access JWT, and the database resolves it. A caller who somehow bypasses Cloudflare Access could spoof the email. The compensating control is that Cloudflare Access is the sole entry point, and only the founder can add/remove authorised emails.
Phase B enhancement (multi-operator): When multiple operators are added, introduce role-based action gating (e.g., ops_lead can retry webhooks but not adjust inventory without ops_senior role). Not needed with a single operator.
Browser (ops.protocolraw.co.uk)
↓
Cloudflare Access (Google OAuth, validates email)
↓
Portal JavaScript (reads CF Access JWT, extracts email)
↓
Supabase RPC call (service_role key, agent email passed as function argument)
↓
Public RPC wrapper resolves email → agent_id via fn_resolve_ops_agent_v1
↓
Domain function receives trusted agent_id UUID
↓
All audit logging uses resolved agent_id
Agent Resolution Function¶
CREATE OR REPLACE FUNCTION raw_ops.fn_resolve_ops_agent_v1(p_email TEXT)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_agent RECORD;
BEGIN
SELECT id, display_name, role INTO v_agent
FROM agents
WHERE email = lower(trim(p_email)) AND active = true;
IF v_agent IS NULL THEN
RAISE EXCEPTION 'Agent not found or inactive: %', p_email
USING ERRCODE = 'P0002';
END IF;
RETURN jsonb_build_object(
'agent_id', v_agent.id,
'display_name', v_agent.display_name,
'role', v_agent.role
);
END;
$$;
Portal-Side: Extracting Agent Email¶
The Ops Portal already loads behind Cloudflare Access. The CF Access JWT is available as a cookie (CF_Authorization). The portal JavaScript decodes the JWT to extract the authenticated email:
// In system-health.js module init
function getAgentEmail() {
const token = document.cookie
.split('; ')
.find(c => c.startsWith('CF_Authorization='))
?.split('=')[1];
if (!token) return null;
try {
const payload = JSON.parse(atob(token.split('.')[1]));
return payload.email;
} catch { return null; }
}
RPC Wrapper Pattern (v1.2)¶
Every mutation RPC accepts the agent email, resolves it server-side, and uses the resolved agent_id for audit logging. The email comes from the Cloudflare Access JWT, not from user input.
-- Example: all v1.2 mutation RPCs follow this pattern
CREATE OR REPLACE FUNCTION public.ops_retry_lab_quarantine(p_failure_id UUID, p_agent_email TEXT)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_agent JSONB;
BEGIN
-- Resolve agent (throws if not found/inactive)
v_agent := raw_ops.fn_resolve_ops_agent_v1(p_agent_email);
-- Delegate to domain function with trusted agent_id
RETURN raw_ops.fn_retry_lab_quarantine_v1(p_failure_id, (v_agent->>'agent_id')::UUID);
END;
$$;
GRANT EXECUTE ON FUNCTION public.ops_retry_lab_quarantine TO service_role;
Key Constraints¶
- No function accepts
p_entered_by TEXTor similar free-text actor fields. All actor identity is resolved from agent email viafn_resolve_ops_agent_v1. - All domain functions accept
p_agent_id UUID, not email or display name. - The
agentstable is the source of truth. Daniel's record is created before he starts. Only the founder can add/remove agents. - Read-only functions (dashboard, diagnostics, cron health) do not require agent resolution.
Part 2: System Health Dashboard¶
Purpose¶
Single screen the Ops Lead checks every morning. Shows the health of every automated system with green/amber/red indicators. Clicking any non-green tile opens the relevant Exception Queue panel.
Health Tiles¶
| # | Tile | Green | Amber | Red |
|---|---|---|---|---|
| 1 | Order Pipeline | 0 failed webhooks, SLO met | 1-5 failed in 24h | >5 failed OR SLO breach |
| 2 | Email Delivery | 0 stuck, <10 pending | 1-10 stuck OR 10-50 pending | >10 stuck OR >50 pending |
| 3 | Inventory | All SKUs above reorder | Any SKU in LOW band | Any SKU CRITICAL |
| 4 | Batch Testing | 0 batches >3 days in QA_HOLD | 1+ batch >3 days | 1+ batch >5 days |
| 5 | Lab Processing | 0 quarantined emails | 1-2 quarantined | 3+ quarantined |
| 6 | Refunds | 0 failed or stuck | 1-3 retrying | Any stuck >1h or dead-lettered |
| 7 | Support | 0 unassigned >1h | 1-3 waiting | >3 waiting >1h |
| 8 | Subscriptions | 0 failed Seal webhooks | 1-5 failed | >5 failed |
| 9 | Payments | No dunning stalls | Day 7+ without email | Day 10+ (imminent pause) |
| 10 | Deliveries | <2% exception rate (7d) | 2-5% exceptions | >5% exceptions |
| 11 | Referrals | 0 fraud-flagged | 1-3 flagged | >3 flagged |
| 12 | Proof Pages | Last check passed | Last check >1h ago | Last check failed |
| 13 | Background Jobs | All critical jobs healthy | 1+ stale (missed expected interval) | 1+ overdue (missed 2x expected interval), failed, missing, or never run |
Dashboard Data Function¶
CREATE OR REPLACE FUNCTION raw_ops.fn_get_system_health_dashboard_v1()
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_result JSONB;
BEGIN
SELECT jsonb_build_object(
'generated_at', now(),
'order_pipeline', (
SELECT jsonb_build_object(
'failed_24h', COUNT(*) FILTER (WHERE processing_status = 'failed'
AND received_at > now() - interval '24 hours'),
'pending', COUNT(*) FILTER (WHERE processing_status = 'pending'),
'slo_breach', EXISTS (
SELECT 1 FROM webhook_inbox
WHERE processing_status = 'pending'
AND received_at < now() - interval '5 minutes'
),
'oldest_pending_mins', COALESCE(
EXTRACT(EPOCH FROM (now() - MIN(received_at)
FILTER (WHERE processing_status = 'pending')))::int / 60, 0)
) FROM webhook_inbox
),
'email_delivery', (
SELECT jsonb_build_object(
'stuck', COUNT(*) FILTER (WHERE status = 'pending'
AND created_at < now() - interval '30 minutes'),
'pending_total', COUNT(*) FILTER (WHERE status = 'pending'),
'failed_today', COUNT(*) FILTER (WHERE status = 'failed'
AND created_at >= CURRENT_DATE)
) FROM email_outbox
),
'inventory', (
SELECT COALESCE(jsonb_agg(jsonb_build_object(
'sku', sku, 'total_units', total_units,
'level', CASE
WHEN total_units <= 20 THEN 'critical'
WHEN total_units <= 40 THEN 'low'
ELSE 'ok' END
)), '[]'::jsonb)
FROM (
SELECT p.sku, SUM(FLOOR(i.qty_available / p.kg))::int as total_units
FROM inventory i
JOIN batches b ON i.batch_id = b.id
JOIN products p ON b.product_id = p.id
WHERE b.status = 'RELEASED' AND i.qty_available > 0
GROUP BY p.sku
) inv
),
'batch_testing', (
SELECT jsonb_build_object(
'qa_hold_count', COUNT(*),
'oldest_qa_hold_days', COALESCE(EXTRACT(DAY FROM (now() - MIN(created_at)))::int, 0),
'over_3_days', COUNT(*) FILTER (WHERE created_at < now() - interval '3 days'),
'over_5_days', COUNT(*) FILTER (WHERE created_at < now() - interval '5 days')
) FROM batches WHERE status = 'QA_HOLD'
),
'lab_processing', (
SELECT jsonb_build_object(
'pending_review', COUNT(*) FILTER (WHERE status = 'pending_review'),
'oldest_hours', COALESCE(
EXTRACT(EPOCH FROM (now() - MIN(created_at)
FILTER (WHERE status = 'pending_review')))::int / 3600, 0)
) FROM lab_email_failures
),
'refunds', (
SELECT jsonb_build_object(
'dead_lettered', COUNT(*) FILTER (WHERE dead_lettered_at IS NOT NULL),
'failed_retrying', COUNT(*) FILTER (
WHERE status = 'failed' AND dead_lettered_at IS NULL),
'pending', COUNT(*) FILTER (WHERE status = 'pending'),
'oldest_pending_mins', COALESCE(
EXTRACT(EPOCH FROM (now() - MIN(created_at)
FILTER (WHERE status = 'pending')))::int / 60, 0)
) FROM refunds
),
'subscriptions', (
SELECT jsonb_build_object(
'failed_webhooks', COUNT(*) FILTER (WHERE processing_status = 'failed'),
'pending_webhooks', COUNT(*) FILTER (WHERE processing_status = 'pending')
) FROM seal_webhook_inbox
),
'payments', (
SELECT jsonb_build_object(
'dunning_active', COUNT(*),
'day7_plus', COUNT(*) FILTER (
WHERE dunning_started_at < now() - interval '7 days'),
'day10_plus', COUNT(*) FILTER (
WHERE dunning_started_at < now() - interval '10 days')
) FROM subscriptions
WHERE payment_status IN ('failing', 'failed') AND status = 'ACTIVE'
),
'deliveries', (
SELECT jsonb_build_object(
'total_7d', COUNT(*),
'exceptions_7d', COUNT(*) FILTER (
WHERE status IN ('EXCEPTION', 'FAILED', 'RETURNED')),
'exception_rate_pct', CASE WHEN COUNT(*) > 0
THEN ROUND(COUNT(*) FILTER (
WHERE status IN ('EXCEPTION', 'FAILED', 'RETURNED'))
* 100.0 / COUNT(*), 1)
ELSE 0 END
) FROM shipments
WHERE created_at > now() - interval '7 days'
),
'support', (
SELECT jsonb_build_object(
'unassigned_total', COUNT(*) FILTER (
WHERE assigned_to IS NULL AND status NOT IN ('resolved', 'closed')),
'unassigned_over_1h', COUNT(*) FILTER (
WHERE assigned_to IS NULL AND status NOT IN ('resolved', 'closed')
AND created_at < now() - interval '1 hour'),
'open_total', COUNT(*) FILTER (
WHERE status NOT IN ('resolved', 'closed'))
) FROM support_tickets
),
'referrals', (
SELECT jsonb_build_object(
'fraud_flagged', COUNT(*) FILTER (WHERE status = 'fraud_flagged'),
'oldest_flagged_days', COALESCE(
EXTRACT(DAY FROM (now() - MIN(created_at)
FILTER (WHERE status = 'fraud_flagged')))::int, 0)
) FROM referrals
),
'proof_pages', (
SELECT jsonb_build_object(
'last_check_at', MAX(run_at),
'last_status', (
SELECT status FROM monitoring_runs
WHERE check_name = 'proof_portal_health'
ORDER BY run_at DESC LIMIT 1),
'minutes_since_check', COALESCE(
EXTRACT(EPOCH FROM (now() - MAX(run_at)))::int / 60, 999)
) FROM monitoring_runs
WHERE check_name = 'proof_portal_health'
),
'background_jobs', raw_ops.fn_get_cron_health_summary_v1()
) INTO v_result;
RETURN v_result;
END;
$$;
Cron Health from Registry (not pattern matching)¶
v1.0 used LIKE '%* * * *%' on cron schedule strings. That is unreliable. v1.1 uses an explicit registry of critical jobs with expected maximum silence thresholds.
CREATE TABLE IF NOT EXISTS raw_ops.critical_cron_registry (
job_name TEXT PRIMARY KEY,
display_name TEXT NOT NULL,
max_silence_minutes INTEGER NOT NULL,
category TEXT NOT NULL DEFAULT 'operations',
created_at TIMESTAMPTZ DEFAULT now()
);
INSERT INTO raw_ops.critical_cron_registry (job_name, display_name, max_silence_minutes, category)
VALUES
('process-outbox-every-minute', 'Email sender', 5, 'email'),
('process-lifecycle-events', 'Lifecycle events', 5, 'email'),
('monitor-outbox-health', 'Outbox health check', 10, 'monitoring'),
('monitor-order-ingestion-slo', 'Order SLO check', 10, 'monitoring'),
('monitor-heartbeat', 'Monitoring heartbeat', 20, 'monitoring'),
('monitor-webhook-health', 'Webhook health check', 15, 'monitoring'),
('monitor-stock-levels', 'Stock level check', 20, 'inventory'),
('monitor-lab-ingestion', 'Lab ingestion check', 20, 'lab'),
('dunning-processor', 'Payment dunning', 30, 'subscriptions'),
('canary-system-health', 'Infrastructure canary', 15, 'infrastructure'),
('process-refund-retries', 'Refund retry processor', 5, 'refunds'),
('process-proof-jobs', 'Proof page generator', 5, 'proof'),
('process-proof-jobs', 'Proof page generator', 5, 'proof'),
('process-lifecycle-events', 'Lifecycle event sender', 5, 'email')
ON CONFLICT (job_name) DO NOTHING;
CREATE OR REPLACE FUNCTION raw_ops.fn_get_cron_health_summary_v1()
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
RETURN (
SELECT
jsonb_build_object(
'total_critical', COUNT(*),
'healthy', COUNT(*) FILTER (WHERE health = 'healthy'),
'stale', COUNT(*) FILTER (WHERE health = 'stale'),
'critical', COUNT(*) FILTER (WHERE health IN ('failed', 'overdue', 'missing', 'never_run')),
'disabled', COUNT(*) FILTER (WHERE health = 'disabled'),
'jobs', jsonb_agg(job_detail ORDER BY health DESC, display_name)
)
FROM (
SELECT
r.job_name,
r.display_name,
r.max_silence_minutes,
r.category,
j.active,
lr.start_time as last_run,
lr.status as last_status,
CASE
WHEN j.jobid IS NULL THEN 'missing'
WHEN j.active = false THEN 'disabled'
WHEN lr.status = 'failed' THEN 'failed'
WHEN lr.start_time IS NULL THEN 'never_run'
WHEN lr.start_time < now() - (r.max_silence_minutes * 2 || ' minutes')::interval THEN 'overdue'
WHEN lr.start_time < now() - (r.max_silence_minutes || ' minutes')::interval THEN 'stale'
ELSE 'healthy'
END as health,
jsonb_build_object(
'job_name', r.job_name,
'display_name', r.display_name,
'category', r.category,
'active', COALESCE(j.active, false),
'last_run', lr.start_time,
'last_status', lr.status,
'last_error', CASE WHEN lr.status = 'failed'
THEN LEFT(lr.return_message, 200) ELSE NULL END,
'max_silence_minutes', r.max_silence_minutes,
'health', CASE
WHEN j.jobid IS NULL THEN 'missing'
WHEN j.active = false THEN 'disabled'
WHEN lr.status = 'failed' THEN 'failed'
WHEN lr.start_time IS NULL THEN 'never_run'
WHEN lr.start_time < now() - (r.max_silence_minutes * 2 || ' minutes')::interval THEN 'overdue'
WHEN lr.start_time < now() - (r.max_silence_minutes || ' minutes')::interval THEN 'stale'
ELSE 'healthy'
END
) as job_detail
FROM raw_ops.critical_cron_registry r
LEFT JOIN cron.job j ON j.jobname = r.job_name
LEFT JOIN LATERAL (
SELECT * FROM cron.job_run_details jrd
WHERE jrd.jobid = j.jobid
ORDER BY jrd.start_time DESC LIMIT 1
) lr ON true
) health_check
);
END;
$$;
UI Specification¶
Auto-refresh: Every 60 seconds. Manual refresh button.
Tile interaction: Clicking any amber/red tile scrolls to and expands the relevant Exception Queue panel.
Tile layout: 4 columns desktop, 2 tablet, 1 mobile. Sort: red first, amber second, green last.
Colour tokens: Green #2D5144 on #E8F5E9, Amber #F57F17 on #FFF8E1, Red #C62828 on #FFEBEE.
Part 3: Exception Queue¶
Design Principles¶
- Every action has an explicit execution path. "What picks it up, when, and what happens if it doesn't" is documented for every state-flip.
- No bulk "retry all" buttons. Operator selects specific items or uses capped batches with confirmation.
- Human-first labels. UUIDs hidden by default. Error types shown as plain English. Technical detail in expandable panel.
- Confirmation dialogs on every mutation with downstream impact description.
3.1 Lab Quarantine Viewer¶
Source: raw_ops.lab_email_failures WHERE status = 'pending_review'
Columns (human labels):
| Display Label | Source |
|---|---|
| Received | created_at (formatted: "2 hours ago") |
| Email Subject | email_subject |
| Problem | Human label from error_type (see table below) |
| Batch (if known) | Extracted from email subject or batch_code field |
Error type → human label:
| error_type | Display | Guidance Text |
|---|---|---|
| batch_not_found | Batch not created yet | Create the batch in the Batches tab first, then retry. |
| batch_not_in_qa_hold | Batch already processed | This batch was already released or rejected. Check batch status. |
| ai_parsing_failed | Could not read the PDF | Open the PDF manually and enter results using the form below. |
| incomplete_parse | PDF partially readable | Some data was missing. Open the PDF and enter results manually. |
| no_pdf_attachment | No PDF in email | This email matched the lab filter but had no PDF attached. Dismiss if not a lab result. |
| storage_upload_failed | System error | Escalate to founder. This is an infrastructure issue. |
| db_write_failed | System error | Escalate to founder. This is an infrastructure issue. |
Actions:
"View PDF" — Opens PDF from Supabase Storage. No mutation.
"Retry" — Re-queues for processing. Enabled only for batch_not_found, ai_parsing_failed, and incomplete_parse.
Execution path: Sets status to retry_queued. The lab-email-ingestion Edge Function's 30-minute fallback poll picks up retry_queued items and reprocesses them. If the Edge Function is not running, the lab_ingestion_health monitor will fire a warning within 15 minutes.
CREATE OR REPLACE FUNCTION raw_ops.fn_retry_lab_quarantine_v1(
p_failure_id UUID,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_failure RECORD;
BEGIN
SELECT * INTO v_failure
FROM lab_email_failures
WHERE id = p_failure_id AND status = 'pending_review';
IF v_failure IS NULL THEN
RETURN jsonb_build_object('success', false, 'error', 'Item not found or not pending review');
END IF;
IF v_failure.error_type IN ('storage_upload_failed', 'db_write_failed') THEN
RETURN jsonb_build_object('success', false,
'error', 'This is an infrastructure error. Please escalate to the founder.');
END IF;
-- Only allow retry for error types where retry has a reasonable chance of success
IF v_failure.error_type NOT IN ('batch_not_found', 'ai_parsing_failed', 'incomplete_parse') THEN
RETURN jsonb_build_object('success', false,
'error', 'This error type (' || v_failure.error_type || ') cannot be resolved by retry. '
|| 'Use "Enter Results Manually" or "Dismiss" instead.');
END IF;
UPDATE lab_email_failures
SET status = 'retry_queued', updated_at = now()
WHERE id = p_failure_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('LAB_QUARANTINE', p_failure_id, 'INFO', 'Quarantined lab email queued for retry',
jsonb_build_object('agent_id', p_agent_id, 'error_type', v_failure.error_type));
RETURN jsonb_build_object('success', true, 'failure_id', p_failure_id);
END;
$$;
"Enter Results Manually" — Opens manual lab result form. See Part 4.
"Dismiss" — For non-lab emails. Requires a reason.
CREATE OR REPLACE FUNCTION raw_ops.fn_dismiss_lab_quarantine_v1(
p_failure_id UUID,
p_reason TEXT,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
UPDATE lab_email_failures
SET status = 'dismissed', resolved_at = now(),
resolved_by = p_agent_id::text,
resolution_notes = p_reason
WHERE id = p_failure_id AND status = 'pending_review';
IF NOT FOUND THEN
RETURN jsonb_build_object('success', false, 'error', 'Item not found or not pending');
END IF;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('LAB_QUARANTINE', p_failure_id, 'INFO', 'Quarantined email dismissed',
jsonb_build_object('agent_id', p_agent_id, 'reason', p_reason));
RETURN jsonb_build_object('success', true);
END;
$$;
3.2 Dead-Lettered Refunds¶
Source: raw_ops.refunds WHERE dead_lettered_at IS NOT NULL
Columns: Customer (email), Order (order number), Amount, Type, Failure Reason, When
Expandable detail: Shopify error message, retry count, refund ID (for Shopify Admin lookup).
Actions:
"Send again" — Resets refund to pending for automatic retry.
Execution path: Sets status to pending, clears dead-letter fields. The on_refund_status_change database trigger fires, which calls process-refund Edge Function via pg_net within seconds. If the Edge Function is down, the refund stays in pending and the refund_health monitor fires within 15 minutes.
CREATE OR REPLACE FUNCTION raw_ops.fn_retry_dead_lettered_refund_v1(
p_refund_id UUID,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_refund RECORD;
BEGIN
SELECT * INTO v_refund FROM refunds WHERE id = p_refund_id;
IF v_refund IS NULL OR v_refund.dead_lettered_at IS NULL THEN
RETURN jsonb_build_object('success', false, 'error', 'Refund not found or not dead-lettered');
END IF;
UPDATE refunds
SET status = 'pending',
dead_lettered_at = NULL,
dead_letter_reason = NULL,
retry_count = 0,
next_retry_at = NULL,
shopify_error = NULL
WHERE id = p_refund_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('REFUND', p_refund_id, 'INFO', 'Dead-lettered refund retried via portal',
jsonb_build_object('agent_id', p_agent_id, 'amount_gbp', v_refund.amount_gbp));
RETURN jsonb_build_object('success', true, 'refund_id', p_refund_id);
END;
$$;
"Record manual refund" — For cases where the Ops Lead processed the refund directly in Shopify Admin.
Form: Shopify refund ID (from Shopify Admin), confirmation checkbox ("I have confirmed this refund was processed in Shopify").
CREATE OR REPLACE FUNCTION raw_ops.fn_resolve_refund_manually_v1(
p_refund_id UUID,
p_shopify_refund_id TEXT,
p_notes TEXT,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
UPDATE refunds
SET status = 'processed',
dead_lettered_at = NULL,
dead_letter_reason = NULL,
shopify_refund_id = p_shopify_refund_id,
internal_notes = COALESCE(internal_notes, '') ||
E'\n[Manual: ' || p_notes || ' | agent: ' || p_agent_id || ' | ' || now()::text || ']'
WHERE id = p_refund_id;
IF NOT FOUND THEN
RETURN jsonb_build_object('success', false, 'error', 'Refund not found');
END IF;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('REFUND', p_refund_id, 'INFO', 'Refund resolved manually via portal',
jsonb_build_object('agent_id', p_agent_id, 'shopify_refund_id', p_shopify_refund_id));
RETURN jsonb_build_object('success', true);
END;
$$;
3.3 Failed Webhooks (Safe Retry)¶
Replaces v1.0's "Retry All Failed" with safe operator controls.
Source: webhook_inbox WHERE processing_status = 'failed' and seal_webhook_inbox WHERE processing_status = 'failed'
Two sub-panels: Shopify Webhooks, Seal Webhooks. Same UI pattern for both.
Columns: Received (relative time), Type (webhook topic, human-labelled), Error (first 100 chars), Attempts
UI Controls:
- Summary bar: "12 failed webhooks. 8 are the same error: Connection timeout."
- Error grouping: Group by error message. Show count per group.
- Select: Checkbox per item. "Select all in this group" shortcut.
- "Retry selected" button: Max 25 per action. Confirmation: "Retry [N] failed webhooks? If the underlying issue is still present, these will fail again."
- "Retry next 10" button: Takes the 10 oldest failed items only.
CREATE OR REPLACE FUNCTION raw_ops.fn_retry_selected_webhooks_v1(
p_webhook_ids UUID[],
p_agent_id UUID,
p_source TEXT DEFAULT 'shopify' -- 'shopify' or 'seal'
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_count INTEGER;
v_capped_ids UUID[];
BEGIN
-- Validate source
IF p_source NOT IN ('shopify', 'seal') THEN
RETURN jsonb_build_object('success', false,
'error', 'Invalid source: ' || p_source || '. Must be shopify or seal.');
END IF;
-- Hard cap at 25
v_capped_ids := p_webhook_ids[1:25];
IF p_source = 'shopify' THEN
UPDATE webhook_inbox
SET processing_status = 'pending', retry_count = retry_count + 1
WHERE id = ANY(v_capped_ids) AND processing_status = 'failed' AND retry_count < 10;
ELSIF p_source = 'seal' THEN
UPDATE seal_webhook_inbox
SET processing_status = 'pending', retry_count = retry_count + 1
WHERE id = ANY(v_capped_ids) AND processing_status = 'failed' AND retry_count < 10;
END IF;
GET DIAGNOSTICS v_count = ROW_COUNT;
INSERT INTO ops_events (entity_type, kind, message, meta)
VALUES ('WEBHOOK', 'INFO', v_count || ' ' || p_source || ' webhooks queued for retry',
jsonb_build_object('agent_id', p_agent_id, 'count', v_count,
'source', p_source, 'requested', array_length(p_webhook_ids, 1)));
RETURN jsonb_build_object('success', true, 'retried', v_count,
'capped', array_length(p_webhook_ids, 1) > 25);
END;
$$;
Execution path: Setting processing_status = 'pending' causes the webhook handler (triggered by the existing webhook processing cron or real-time trigger) to reprocess them. The webhook_health monitor checks every 5 minutes.
3.4 Email Delivery Recovery¶
Source: email_outbox WHERE status = 'failed'
Note: v1.0 had a mismatch between source (stuck pending emails) and function (retried failed emails). v1.1 fixes this.
Panel shows two metrics: - Failed emails (status = 'failed'): Can be retried. - Stuck emails (status = 'pending', age > 30 minutes): Cannot be retried (they're already pending). Indicates the email worker itself is stalled. Guidance: "Check Background Jobs tile. If the email sender job is stale or failed, use 'Run email sender now' in Worker Controls."
Action: "Retry failed emails" (max 25)
CREATE OR REPLACE FUNCTION raw_ops.fn_retry_failed_outbox_v1(
p_limit INTEGER DEFAULT 25,
p_agent_id UUID DEFAULT NULL
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_count INTEGER;
BEGIN
WITH retried AS (
UPDATE email_outbox
SET status = 'pending', updated_at = now(),
retry_count = COALESCE(retry_count, 0) + 1
WHERE id IN (
SELECT id FROM email_outbox
WHERE status = 'failed' AND COALESCE(retry_count, 0) < 5
ORDER BY created_at ASC
LIMIT p_limit
)
RETURNING id
)
SELECT COUNT(*) INTO v_count FROM retried;
INSERT INTO ops_events (entity_type, kind, message, meta)
VALUES ('EMAIL_OUTBOX', 'INFO', v_count || ' failed emails queued for retry',
jsonb_build_object('agent_id', p_agent_id, 'count', v_count));
RETURN jsonb_build_object('success', true, 'retried', v_count);
END;
$$;
Execution path: Emails set to pending are picked up by the process-outbox cron job (runs every minute). If that job is stalled, the outbox health monitor fires within 2 minutes.
3.5 Inventory Adjustment Form¶
Unchanged from v1.0 except: uses agent_id from auth, not free-text.
Calls existing raw_ops.execute_manual_inventory_adjustment() with agent_id resolved from JWT.
Confirmation dialog: "You are about to adjust batch [CODE] by [±X] kg. Current stock: [Y] kg. After adjustment: [Z] kg. Reason: [reason]. This cannot be undone. Continue?"
3.6 Referral Fraud Review Panel¶
Unchanged from v1.0 except: uses agent_id from auth.
Calls existing raw_ops.manual_override_referral() with agent_id.
3.7 Dunning Pipeline Viewer¶
Unchanged from v1.0 except: uses agent_id from auth, and Day 10+ rows show warning: "This subscription will auto-pause tomorrow. Consider calling the customer directly."
3.8 Stuck Proof Jobs¶
Source: raw_ops.proof_jobs WHERE state IN ('queued', 'failed') AND created_at < now() - interval '5 minutes'
When: Batch created but no QR code or proof page generated. The proof_system_health monitor fires.
Columns: Batch Code, Created, State, Attempts, Last Error
Action: "Retry" — Resets proof job for reprocessing.
Execution path: Sets state to queued and resets attempts. The process-proof-jobs Edge Function (invoked by pg_cron every minute) picks up queued jobs. If the proof worker itself is stalled, the Ops Lead should also use "Run proof generator now" in Worker Controls.
CREATE OR REPLACE FUNCTION raw_ops.fn_retry_stuck_proof_job_v1(
p_batch_id UUID,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_job_id UUID;
BEGIN
-- Reset existing stuck job
UPDATE proof_jobs
SET state = 'queued', attempts = 0, last_error = NULL
WHERE batch_id = p_batch_id AND state IN ('queued', 'failed')
RETURNING id INTO v_job_id;
IF v_job_id IS NULL THEN
-- Create new job if none exists
INSERT INTO proof_jobs (batch_id, job_key, state)
VALUES (p_batch_id, 'proof:batch:' || p_batch_id, 'queued')
RETURNING id INTO v_job_id;
END IF;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('PROOF_JOB', v_job_id, 'INFO', 'Proof job retried via portal',
jsonb_build_object('agent_id', p_agent_id, 'batch_id', p_batch_id));
RETURN jsonb_build_object('success', true, 'proof_job_id', v_job_id);
END;
$$;
Part 4: Manual Lab Result Entry (Redesigned)¶
Problem with v1.0¶
v1.0 captured a single Enterobacteriaceae value and declared PASS if < 5000. The actual lab certificate contains an official verdict and sub-sample data. A single-value check can produce a false PASS.
v1.2 Design: Lab Verdict First¶
The Ops Lead enters what the lab certificate says. The system records it. The lab's official verdict governs batch release, not a portal-side calculation.
Form Fields:
| Field | Type | Required | Notes |
|---|---|---|---|
| Batch | Dropdown (QA_HOLD batches only) | Yes | |
| Lab Name | Text (default: i2 Analytical) | Yes | |
| Certificate Date | Date picker | Yes | |
| Certificate Reference | Text | No | |
| Lab Verdict | Dropdown: PASS / FAIL | Yes | As stated on the certificate |
| Salmonella (25g) | Dropdown: Not Detected / Detected | Yes | |
| Listeria (25g) | Dropdown: Not Detected / Detected | Yes | |
| Enterobacteriaceae | Number (cfu/g, official value) | Yes | |
| Lab Certificate PDF | File upload | Yes | Stored in Supabase Storage. Required for both PASS and FAIL. |
| Linked Quarantine Item | Auto-populated if opened from quarantine | Auto |
Safety Guards:
-
Lab verdict is the governing field. However, a PASS verdict is hard-blocked if pathogens are detected. If the operator selects PASS but enters Salmonella = Detected or Listeria = Detected, the form refuses submission: "Cannot submit PASS with a detected pathogen. Check the certificate. If the lab certificate genuinely says PASS despite these values, escalate to the founder." This is a safety invariant, not a UX choice.
-
One result per batch. Idempotency guard prevents double-submit and prevents manual entry when an automated result already exists:
-- One lab result per batch, period. Prevents both double-manual and manual-after-automated.
CREATE UNIQUE INDEX IF NOT EXISTS idx_lab_results_one_per_batch
ON raw_ops.lab_results (batch_id);
-
Batch must be in QA_HOLD. Function rejects if batch is already released/rejected.
-
PDF required for both PASS and FAIL. Failed-batch evidence is as important as pass evidence for compliance.
CREATE OR REPLACE FUNCTION raw_ops.fn_manual_lab_result_entry_v1(
p_batch_code TEXT,
p_lab_name TEXT,
p_reported_at DATE,
p_certificate_ref TEXT,
p_lab_verdict TEXT, -- 'PASS' or 'FAIL' (from certificate)
p_salmonella TEXT, -- 'absent' or 'detected'
p_listeria TEXT, -- 'absent' or 'detected'
p_enterobacteriaceae NUMERIC,
p_certificate_pdf_path TEXT,
p_agent_id UUID,
p_quarantine_id UUID DEFAULT NULL
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_batch RECORD;
v_lab_result_id UUID;
BEGIN
-- Validate batch exists and is in QA_HOLD
SELECT * INTO v_batch FROM batches WHERE batch_code = p_batch_code;
IF v_batch IS NULL THEN
RETURN jsonb_build_object('success', false, 'error', 'Batch not found');
END IF;
IF v_batch.status != 'QA_HOLD' THEN
RETURN jsonb_build_object('success', false,
'error', 'Batch is ' || v_batch.status || ', not QA_HOLD. Cannot enter results.');
END IF;
-- Hard-block: PASS with detected pathogen is not allowed
IF p_lab_verdict = 'PASS' AND (p_salmonella = 'detected' OR p_listeria = 'detected') THEN
RETURN jsonb_build_object('success', false,
'error', 'Cannot record PASS with a detected pathogen. '
|| 'Check the certificate. If the lab certificate genuinely says PASS despite these values, '
|| 'escalate to the founder for manual review.');
END IF;
-- Insert lab result. The on_lab_result_insert trigger handles batch release/rejection.
-- Lab verdict governs the outcome, not portal-side calculation.
INSERT INTO lab_results (
batch_id, lab_name, reported_at, certificate_reference,
salmonella_result, listeria_result, enterobacteriaceae_cfu_g,
outcome, entered_by, entry_method, certificate_pdf_path,
source_email_id -- reuse for idempotency (NULL for manual, unique index on batch_id + entry_method)
) VALUES (
v_batch.id, p_lab_name, p_reported_at, p_certificate_ref,
p_salmonella, p_listeria, p_enterobacteriaceae,
p_lab_verdict, p_agent_id::text, 'manual_portal', p_certificate_pdf_path
)
RETURNING id INTO v_lab_result_id;
-- Mark quarantine item as resolved if linked
IF p_quarantine_id IS NOT NULL THEN
UPDATE lab_email_failures
SET status = 'resolved_manual', resolved_at = now(),
resolved_by = p_agent_id::text
WHERE id = p_quarantine_id;
END IF;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('LAB_RESULT', v_lab_result_id, 'INFO', 'Lab result entered manually via portal',
jsonb_build_object(
'agent_id', p_agent_id, 'batch_code', p_batch_code,
'lab_verdict', p_lab_verdict
));
RETURN jsonb_build_object(
'success', true,
'lab_result_id', v_lab_result_id,
'batch_code', p_batch_code,
'lab_verdict', p_lab_verdict,
'batch_new_status', CASE WHEN p_lab_verdict = 'PASS' THEN 'RELEASED' ELSE 'REJECTED' END
);
EXCEPTION
WHEN unique_violation THEN
RETURN jsonb_build_object('success', false,
'error', 'A lab result has already been entered for this batch. Check the Batches tab.');
END;
$$;
Execution path: The INSERT INTO lab_results fires the existing on_lab_result_insert trigger, which handles batch status change to RELEASED or REJECTED, order allocation (if PASS), proof portal update, and partner notification. This is the same path as automated lab processing. The manual form is just a different entry point into the same pipeline.
Part 5: Worker Controls¶
Purpose¶
When a background worker is stalled, the Ops Lead needs a "run it now" button, not just visibility that it's broken.
Available Controls¶
| Button Label | What It Does | When To Use |
|---|---|---|
| Run email sender now | Invokes process-outbox Edge Function |
Emails stuck as pending, email worker cron stale |
| Run proof generator now | Invokes process-proof-jobs Edge Function |
Proof jobs stuck as queued |
| Run webhook processor now | Invokes shopify-webhook reprocessor |
Webhooks stuck as pending after retry |
| Run dunning processor now | Invokes process-dunning Edge Function |
Dunning emails not sending on schedule |
| Run lifecycle processor now | Invokes send-lifecycle-events Edge Function |
Lifecycle events not sending on schedule |
Implementation¶
Each button calls an Edge Function via the Ops Portal's authenticated Supabase client. The Edge Functions already exist; the portal just needs a way to invoke them manually.
CREATE OR REPLACE FUNCTION raw_ops.fn_trigger_worker_run_v1(
p_worker TEXT,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
DECLARE
v_allowed_workers TEXT[] := ARRAY[
'process-outbox', 'process-proof-jobs', 'process-dunning',
'shopify-webhook', 'send-lifecycle-events'
];
BEGIN
IF NOT (p_worker = ANY(v_allowed_workers)) THEN
RETURN jsonb_build_object('success', false,
'error', 'Worker not in allowed list: ' || p_worker);
END IF;
-- Use pg_net to invoke the Edge Function (same pattern as fn_invoke_monitor)
PERFORM raw_ops.fn_invoke_edge_function(p_worker, '{}'::jsonb);
INSERT INTO ops_events (entity_type, kind, message, meta)
VALUES ('WORKER', 'INFO', 'Worker manually triggered: ' || p_worker,
jsonb_build_object('agent_id', p_agent_id, 'worker', p_worker));
RETURN jsonb_build_object('success', true, 'worker', p_worker, 'invoked_at', now());
END;
$$;
Confirmation dialog: "This will run the [worker name] immediately. Use this when the background job appears stalled. Continue?"
Cooldown: Button disabled for 60 seconds after invocation to prevent double-firing.
Part 6: Missing-Event Recovery¶
Problem¶
v1.0 handled failed rows. It did not handle events that never arrived.
6.1 Pull Missing Order from Shopify¶
When: A customer says their order went through but it's not in the system. Or the order_ingestion_slo monitor fires but webhook_inbox shows no failed webhooks (meaning the webhook never arrived).
Form: Single field: Shopify Order Number (e.g., #1042)
Implementation: Calls a new Edge Function that fetches the order from Shopify Admin API and replays it through the standard ingestion pipeline.
Portal → ops_replay_shopify_order RPC → pg_net → replay-shopify-order Edge Function
↓
Edge Function:
1. GET /admin/api/2024-01/orders/{order_id}.json (by order number)
2. Transform to webhook payload format
3. Call fn_process_shopify_order_v3() (same function used by live webhooks)
4. Return result
CREATE OR REPLACE FUNCTION raw_ops.fn_request_order_replay_v1(
p_order_number TEXT,
p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = raw_ops, pg_temp
AS $$
BEGIN
-- Check if order already exists
IF EXISTS (SELECT 1 FROM orders WHERE order_number_label = p_order_number) THEN
RETURN jsonb_build_object('success', false,
'error', 'Order ' || p_order_number || ' already exists in the system.');
END IF;
-- Invoke the replay Edge Function via pg_net
PERFORM raw_ops.fn_invoke_edge_function('replay-shopify-order',
jsonb_build_object('order_number', p_order_number));
INSERT INTO ops_events (entity_type, kind, message, meta)
VALUES ('ORDER', 'INFO', 'Order replay requested: ' || p_order_number,
jsonb_build_object('agent_id', p_agent_id, 'order_number', p_order_number));
RETURN jsonb_build_object('success', true,
'message', 'Order replay requested. Check the order list in 30 seconds.');
END;
$$;
Note: The replay-shopify-order Edge Function is a new function that needs to be built. It is a thin wrapper: fetch from Shopify Admin API, transform, call the existing processing function.
Part 7: Compliance Recording¶
7.1-7.4: Unchanged from v1.0¶
Temperature Audit Form, Proof Portal Spot Check, HACCP Evidence Pack Logger, Lab Sample Submission Log. All unchanged except: all use agent_id from auth, not free-text.
7.5 Rejected Batch Disposition (New)¶
When: A batch fails lab testing (status = REJECTED). Someone must record what happened to the physical stock.
Form Fields:
| Field | Type | Required |
|---|---|---|
| Batch Code | Dropdown (REJECTED batches without disposition) | Yes |
| Disposition Action | Dropdown: Destroyed / Returned to Supplier / Held for Re-test | Yes |
| Quantity Disposed (kg) | Number | Yes |
| Disposal Method | Text (e.g., "Commercial waste contractor") | Yes if Destroyed |
| Witnessed By | Text | Yes if Destroyed |
| Date | Date | Yes |
| Notes | Text | No |
| Evidence (photos/docs) | File upload | No |
CREATE TABLE IF NOT EXISTS raw_ops.batch_dispositions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
batch_id UUID NOT NULL REFERENCES raw_ops.batches(id),
batch_code TEXT NOT NULL,
disposition_action TEXT NOT NULL
CHECK (disposition_action IN ('destroyed', 'returned_to_supplier', 'held_for_retest')),
quantity_kg NUMERIC NOT NULL,
disposal_method TEXT,
witnessed_by TEXT,
disposition_date DATE NOT NULL,
evidence_paths TEXT[],
notes TEXT,
recorded_by UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE UNIQUE INDEX idx_batch_disposition_one_per_batch
ON raw_ops.batch_dispositions(batch_id);
7.6 Compliance Insert Functions¶
All compliance forms use the same pattern: validate inputs, insert record, log to ops_events, return success.
-- Temperature Audit
CREATE OR REPLACE FUNCTION raw_ops.fn_record_temperature_audit_v1(
p_audit_date DATE, p_shipment_id UUID, p_logger_serial TEXT,
p_max_temp NUMERIC, p_min_temp NUMERIC, p_duration_above_minus5 INTEGER,
p_evidence_paths TEXT[], p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_pass BOOLEAN; v_id UUID;
BEGIN
v_pass := (p_max_temp <= -5.0) OR (p_max_temp <= 0.0 AND COALESCE(p_duration_above_minus5, 0) < 30);
INSERT INTO temperature_audit_records (audit_date, shipment_id, logger_serial,
max_temp_celsius, min_temp_celsius, duration_above_minus5_mins, pass,
evidence_paths, notes, recorded_by)
VALUES (p_audit_date, p_shipment_id, p_logger_serial, p_max_temp, p_min_temp,
p_duration_above_minus5, v_pass, p_evidence_paths, p_notes, p_agent_id::text)
RETURNING id INTO v_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('TEMP_AUDIT', v_id, 'INFO', 'Temperature audit recorded',
jsonb_build_object('agent_id', p_agent_id, 'pass', v_pass, 'shipment_id', p_shipment_id));
RETURN jsonb_build_object('success', true, 'id', v_id, 'pass', v_pass);
END; $$;
-- Proof Portal Spot Check
CREATE OR REPLACE FUNCTION raw_ops.fn_record_proof_spot_check_v1(
p_check_date DATE, p_batch_code TEXT, p_qr_ok BOOLEAN, p_page_ok BOOLEAN,
p_results_ok BOOLEAN, p_pdf_ok BOOLEAN, p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_id UUID;
BEGIN
INSERT INTO proof_portal_spot_checks (check_date, batch_code, qr_scan_ok,
page_loads_ok, results_display_ok, pdf_download_ok, notes, checked_by)
VALUES (p_check_date, p_batch_code, p_qr_ok, p_page_ok, p_results_ok, p_pdf_ok,
p_notes, p_agent_id::text)
RETURNING id INTO v_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('PROOF_SPOT_CHECK', v_id, 'INFO', 'Proof portal spot check recorded',
jsonb_build_object('agent_id', p_agent_id, 'batch_code', p_batch_code,
'all_pass', p_qr_ok AND p_page_ok AND p_results_ok AND p_pdf_ok));
RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;
-- HACCP Evidence Pack
CREATE OR REPLACE FUNCTION raw_ops.fn_record_evidence_pack_v1(
p_ship_date DATE, p_ship_day_type TEXT,
p_pick_list BOOLEAN, p_qc_record BOOLEAN, p_packout_sheet BOOLEAN,
p_photo_proofs BOOLEAN, p_logger_serials BOOLEAN, p_courier_manifest BOOLEAN,
p_file_paths TEXT[], p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_id UUID;
BEGIN
-- Normalize to lowercase to match CHECK constraint
INSERT INTO evidence_packs (ship_date, ship_day_type, pick_list_filed, qc_record_filed,
packout_sheet_filed, photo_proofs_filed, logger_serials_recorded, courier_manifest_filed,
file_paths, notes, recorded_by)
VALUES (p_ship_date, lower(trim(p_ship_day_type)), p_pick_list, p_qc_record, p_packout_sheet,
p_photo_proofs, p_logger_serials, p_courier_manifest, p_file_paths, p_notes, p_agent_id::text)
RETURNING id INTO v_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('EVIDENCE_PACK', v_id, 'INFO', 'Evidence pack recorded',
jsonb_build_object('agent_id', p_agent_id, 'ship_date', p_ship_date));
RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;
-- Lab Sample Submission
CREATE OR REPLACE FUNCTION raw_ops.fn_record_sample_submission_v1(
p_batch_code TEXT, p_sample_sent_date DATE, p_lab_name TEXT,
p_tracking_ref TEXT, p_trays_sampled INTEGER, p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_batch_id UUID; v_id UUID; v_batch_status TEXT;
BEGIN
SELECT id, status INTO v_batch_id, v_batch_status FROM batches WHERE batch_code = p_batch_code;
IF v_batch_id IS NULL THEN
RETURN jsonb_build_object('success', false, 'error', 'Batch not found');
END IF;
IF v_batch_status != 'QA_HOLD' THEN
RETURN jsonb_build_object('success', false,
'error', 'Batch is ' || v_batch_status || ', not in testing. Sample submission is only for batches awaiting lab results.');
END IF;
IF (SELECT status FROM batches WHERE id = v_batch_id) != 'QA_HOLD' THEN
RETURN jsonb_build_object('success', false,
'error', 'Samples can only be logged for batches in testing (QA_HOLD)');
END IF;
INSERT INTO lab_sample_submissions (batch_id, batch_code, sample_sent_date,
lab_name, tracking_reference, trays_sampled, notes, recorded_by)
VALUES (v_batch_id, p_batch_code, p_sample_sent_date, p_lab_name,
p_tracking_ref, p_trays_sampled, p_notes, p_agent_id::text)
RETURNING id INTO v_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('LAB_SAMPLE', v_id, 'INFO', 'Lab sample submission recorded',
jsonb_build_object('agent_id', p_agent_id, 'batch_code', p_batch_code));
RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;
-- Batch Disposition
CREATE OR REPLACE FUNCTION raw_ops.fn_record_batch_disposition_v1(
p_batch_code TEXT, p_disposition_action TEXT, p_quantity_kg NUMERIC,
p_disposal_method TEXT, p_witnessed_by TEXT, p_disposition_date DATE,
p_evidence_paths TEXT[], p_notes TEXT, p_agent_id UUID
)
RETURNS JSONB
LANGUAGE plpgsql SECURITY DEFINER SET search_path = raw_ops, pg_temp AS $$
DECLARE v_batch RECORD; v_id UUID;
BEGIN
SELECT * INTO v_batch FROM batches WHERE batch_code = p_batch_code;
IF v_batch IS NULL THEN
RETURN jsonb_build_object('success', false, 'error', 'Batch not found');
END IF;
IF v_batch.status != 'REJECTED' THEN
RETURN jsonb_build_object('success', false,
'error', 'Only rejected batches can have a disposition record');
END IF;
-- Enforce witness and disposal method for destroyed batches
IF p_disposition_action = 'destroyed' THEN
IF p_disposal_method IS NULL OR trim(p_disposal_method) = '' THEN
RETURN jsonb_build_object('success', false,
'error', 'Disposal method is required when batch is destroyed');
END IF;
IF p_witnessed_by IS NULL OR trim(p_witnessed_by) = '' THEN
RETURN jsonb_build_object('success', false,
'error', 'Witness name is required when batch is destroyed');
END IF;
END IF;
INSERT INTO batch_dispositions (batch_id, batch_code, disposition_action, quantity_kg,
disposal_method, witnessed_by, disposition_date, evidence_paths, notes, recorded_by)
VALUES (v_batch.id, p_batch_code, p_disposition_action, p_quantity_kg,
p_disposal_method, p_witnessed_by, p_disposition_date, p_evidence_paths, p_notes, p_agent_id)
RETURNING id INTO v_id;
INSERT INTO ops_events (entity_type, entity_id, kind, message, meta)
VALUES ('BATCH_DISPOSITION', v_id, 'INFO', 'Batch disposition recorded',
jsonb_build_object('agent_id', p_agent_id, 'batch_code', p_batch_code,
'action', p_disposition_action, 'quantity_kg', p_quantity_kg));
RETURN jsonb_build_object('success', true, 'id', v_id);
END; $$;
Part 8: Canned Diagnostics¶
Unchanged from v1.0 structurally. 10 pre-built read-only views. All wrapped in fn_diagnostic_*_v1() functions returning JSONB arrays, rendered as sortable/filterable tables.
Part 9: UI Language Guidelines¶
Principle¶
The Ops Lead is a compliance-trained operator, not an engineer. Every label, message, and error should be written for someone who has never seen a database.
Rules¶
| Instead of | Write |
|---|---|
| UUID | (hide entirely, show order number or customer email) |
| dead-lettered | stuck (needs attention) |
| retry_count | attempts |
| webhook_topic | event type (with human label: "New Order", "Subscription Updated") |
| processing_status = 'failed' | failed |
| QA_HOLD | Testing in progress |
| RELEASED | Cleared for sale |
| REJECTED | Failed testing |
| DEPLETED | Out of stock |
| fn_retry_* | (never shown, behind button label) |
Error Messages to Operator¶
| System Error | Show to Operator |
|---|---|
P0001: Batch not found |
That batch code wasn't found. Check for typos. |
unique_violation |
This has already been done. Check the relevant tab. |
P0002: Agent not found |
Your account isn't set up yet. Ask the founder to add you. |
| Network error | Connection lost. Check your internet and try again. |
| 500 from Edge Function | Something went wrong on our end. Try once more. If it fails again, escalate to the founder. |
Expandable Technical Detail¶
Every exception item has a "Show details" toggle that reveals: IDs, raw error messages, timestamps, retry counts. This is for when the Ops Lead is on a call with the founder describing an issue. It is collapsed by default.
Part 10: Alert-to-Resolution Mapping¶
Complete mapping of every MON-01 alert to portal action.
| Monitor | Severity | Operator Action | Founder-Only? |
|---|---|---|---|
| order_ingestion_slo | warning/critical | Check Order Pipeline tile → Retry failed webhooks → Pull missing order | If persistent after retry |
| outbox_health | warning/critical | Check Email Delivery tile → Retry failed emails → Run email sender now | If worker repeatedly crashes |
| stock_levels | critical | Inventory Adjustment Form. Call co-packer (Jamie) for reorder | No |
| lab_sla | warning | Check Lab Processing tile → Lab Quarantine Viewer. Check Lab Sample Submission Log | No |
| working_capital | warning | View Batch Testing tile. Review with founder if structural | Yes (if systemic) |
| daily_snapshot | info | Review System Health Dashboard. No action unless red | No |
| allocation_health | warning | Check Inventory tile. Inventory Adjustment if needed | No |
| webhook_health | warning | Retry selected failed webhooks | No |
| webhook_daily_digest | info | Review only | No |
| monitoring_heartbeat | critical | Check Background Jobs. Escalate if heartbeat itself is down | Yes |
| lab_ingestion_health | warning/critical | Lab Quarantine Viewer | If storage_upload_failed |
| courier_watchdog_health | warning | Review in Support Workstation. Call DPD if systemic | No |
| 3pl_export_health | warning | Check Background Jobs. Manual export in Fulfillment tab | No |
| support_health | warning | Assign and resolve tickets in Support Workstation | No |
| dispatch_processing_health | warning | Check Background Jobs + Fulfillment tab | No |
| ad_spend_freshness | warning | Check Make.com dashboard (view-only). Informational | No |
| cost_tracking_health | warning | Check Make.com dashboard (view-only). Informational | No |
| proof_system_health | warning | Stuck Proof Jobs panel (3.8). Run proof generator now (Worker Controls) | If repeated failures |
| proof_portal_health | warning | Run Proof Portal Spot Check. Report to founder if page down | Yes (if site down) |
| lifecycle_events_health | warning | Check Background Jobs. Run lifecycle processor now (Worker Controls) | If repeatedly crashing |
| abandonment_health | warning | Check Background Jobs | No |
| health_data_completeness | info | Informational only | No |
| inventory_planning_health | warning | Review inventory diagnostics. Contact co-packer | No |
| referral_system_health | warning | Referral Fraud Review Panel | No |
| monitor_coverage | info | Informational. Escalate if new monitor needed | Yes |
| canary_system_health | critical | Escalate to founder immediately | Always |
| refund_health | warning | Dead-Lettered Refunds panel | No |
Part 11: Implementation Plan¶
Build Order¶
| Phase | What | Days | Gate |
|---|---|---|---|
| 0 | Auth model | 1 | Agent resolution function, RPC wrapper pattern, portal-side JWT extraction. Test with one sample action. ChatGPT review. |
| 1a | Dashboard data function + cron registry | 1 | All 13 tiles returning correct data. ChatGPT review of tile logic. |
| 1b | Dashboard UI | 1.5 | Tiles rendering, auto-refresh, click-to-expand working. |
| 1c | Lab Quarantine (most complex exception) | 2 | List, retry, dismiss, manual entry with idempotency guard. Full test with mock data. |
| 1d | Refund + Webhook + Outbox exception panels | 2 | Safe retry with caps, confirmation dialogs, execution path verified. |
| 1e | Remaining exceptions (inventory, referral, dunning) | 1.5 | Forms + actions working. |
| 1f | Worker Controls + Order Replay | 1.5 | Run-now buttons, replay Edge Function, cooldown. |
| 1g | Canned Diagnostics | 1 | 10 views rendering as sortable tables. |
| 2 | Compliance Recording Forms | 2 | All 5 forms (temp audit, spot check, evidence pack, sample log, batch disposition). |
| Total | ~13.5 days |
Review Gates¶
Each phase ends with: 1. Claude Code implementation complete 2. Functional test with mock data 3. ChatGPT review of the implemented module (persona: Daniel) 4. Fix any gaps found 5. Move to next phase
Part 12: RPC Registry (Complete)¶
| Public RPC | Delegates To | Auth | Purpose |
|---|---|---|---|
ops_get_system_health |
fn_get_system_health_dashboard_v1 |
Read-only | Dashboard data |
ops_get_cron_health |
fn_get_cron_health_summary_v1 |
Read-only | Cron job status |
ops_diagnostic_* (10 functions) |
fn_diagnostic_*_v1 |
Read-only | Canned diagnostics |
ops_retry_lab_quarantine |
fn_retry_lab_quarantine_v1 |
Agent resolved | Retry quarantined lab email |
ops_dismiss_lab_quarantine |
fn_dismiss_lab_quarantine_v1 |
Agent resolved | Dismiss non-lab email |
ops_manual_lab_result |
fn_manual_lab_result_entry_v1 |
Agent resolved | Enter lab results manually |
ops_retry_dead_lettered_refund |
fn_retry_dead_lettered_refund_v1 |
Agent resolved | Retry stuck refund |
ops_resolve_refund_manually |
fn_resolve_refund_manually_v1 |
Agent resolved | Mark refund processed in Shopify |
ops_retry_selected_webhooks |
fn_retry_selected_webhooks_v1 |
Agent resolved | Retry selected failed webhooks |
ops_retry_failed_outbox |
fn_retry_failed_outbox_v1 |
Agent resolved | Retry failed emails |
ops_mark_dunning_email_sent |
fn_ops_mark_dunning_email_sent_v1 |
Agent resolved | Mark dunning email sent |
ops_retry_proof_job |
fn_retry_stuck_proof_job_v1 |
Agent resolved | Retry stuck proof job |
ops_trigger_worker |
fn_trigger_worker_run_v1 |
Agent resolved | Run background worker now |
ops_request_order_replay |
fn_request_order_replay_v1 |
Agent resolved | Pull missing order from Shopify |
ops_execute_inventory_adjustment |
execute_manual_inventory_adjustment |
Agent resolved | Manual stock correction |
ops_override_referral |
manual_override_referral |
Agent resolved | Approve/reject flagged referral |
ops_record_temp_audit |
fn_record_temperature_audit_v1 |
Agent resolved | Temperature audit entry |
ops_record_spot_check |
fn_record_proof_spot_check_v1 |
Agent resolved | Proof portal spot check |
ops_record_evidence_pack |
fn_record_evidence_pack_v1 |
Agent resolved | HACCP evidence pack |
ops_record_sample_submission |
fn_record_sample_submission_v1 |
Agent resolved | Lab sample log |
ops_record_batch_disposition |
fn_record_batch_disposition_v1 |
Agent resolved | Rejected batch disposal |
Part 13: ChatGPT Review Prompt (Updated)¶
You are Daniel, joining Protocol Raw as Operations Lead at Month 7.
Background: Agent Onboarding Supervisor at RIA Money Transfers.
Strong process execution and compliance skills. Zero coding ability.
Never used a database, terminal, or IDE.
Your tools:
- Operations Portal (ops.protocolraw.co.uk)
- Slack (#ops-alerts, #ops-urgent, #daily-ops)
- Metabase dashboards (read-only)
- Customer.io dashboard (email campaign monitoring)
- Make.com dashboard (view-only)
- Google Drive (HACCP documentation)
- Phone/email (supplier, lab, 3PL coordination)
Physical operations responsibilities:
- 3PL/DPD coordination
- Lab sample logistics (composite samples to UKAS lab)
- HACCP documentation and FSA audit readiness
- Temperature logger audit programme (weekly, 10 shipments)
- Packaging material supply chain coordination
- Co-packer schedule alignment (with Jamie)
- Proof portal QC spot checks (monthly, 5 batches)
- Evidence pack filing per ship day
- Rejected batch disposal coordination and recording
This is v1.2, with targeted fixes after your review of v1.1 found:
- Auth transport was internally inconsistent (header vs argument)
- Server-side retry eligibility not enforced for lab quarantine
- Invalid webhook source fell through silently
- Cron health severity model did not match tile rule
- Batch disposition did not enforce witness/disposal for destroyed batches
- ship_day_type case mismatch between UI and schema
- Document drift (v1.1 labels remaining in a v1.2 document)
Review v1.2 and confirm whether v1.0 and v1.1 issues are resolved.
Then identify any remaining gaps:
1. Any scenario where you're stuck without calling the founder
(that isn't explicitly classified as founder-only)
2. Any action that assumes technical knowledge
3. Any alert without a portal resolution path
4. Any physical workflow without a recording form
5. Any information shown without agency to act on it
6. Any confusing terminology
7. Any compliance obligation without evidence capture
8. Any execution path that is still implicit or hopeful
9. Any auth bypass or governance hole
Be thorough. Be adversarial. Find every gap.
Version History¶
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2026-04-12 | Initial specification | Anton + Claude |
| 1.1 | 2026-04-12 | Complete rewrite: auth model, execution paths, lab entry safety, missing-event recovery, bulk retry safety, honest governing principle, batch disposition, UI language guidelines | Anton + Claude |
| 1.2 | 2026-04-12 | Targeted fixes from ChatGPT review: auth section reworded honestly, dashboard contract completed (support tile, SLO metric), lab idempotency scope fixed, PASS+pathogen hard-blocked, worker allowlist aligned, proof job panel restored, all compliance RPCs completed. Final cleanup: auth transport unified (RPC argument only), server-side retry eligibility enforced, invalid webhook source rejected, cron severity model aligned to tile rules (stale/overdue distinction), batch disposition witness/disposal enforced for destroyed, ship_day_type normalized, all v1.1 drift removed | Anton + Claude |
Document Owner: Protocol Raw Operations Technical Owner: Anton (Founder, Systems Architect) Next Review: 2026-07-12
End of SOP-OPS-01 v1.2