SOP-MON-01: Monitoring & Alerting Architecture v1.0¶
Centralised monitoring system for Protocol Raw operations
Document ID: SOP-MON-01-v1.0
Version: 1.0
Status: ✅ Production Ready
Last Updated: 2026-01-17
Owner: Protocol Raw Operations
Review Date: 2026-04-17
Purpose¶
This SOP defines the centralised monitoring and alerting architecture for Protocol Raw operations. All operational monitors run natively within the Supabase stack, eliminating external dependencies on Make.com for critical alerting.
Key Principles:
- Native Stack: All monitoring runs within Supabase (pg_cron, Edge Functions, PostgreSQL)
- Single Alerter: One Edge Function (ops-alerter) handles all Slack notifications
- Unified Logging: All monitor runs logged to raw_ops.monitoring_runs
- Scale First: Architecture tested for 100,000+ customers
Scope¶
Applies to: All automated operational monitoring and alerting
Components:
- 6 PostgreSQL monitoring functions
- 2 Edge Functions (run-monitor, ops-alerter)
- 6 pg_cron scheduled jobs
- 3 Slack channels (#ops-alerts, #ops-urgent, #daily-ops)
Related SOPs: - SOP-01: Batch Creation & Lab-to-Release (Lab SLA, Working Capital) - SOP-03: Daily Operations (Daily Snapshot) - SOP-05: Email Processing (Outbox Health) - SOP-06: Order Ingestion SLO - SOP-INV-01: Inventory Control (Stock Levels)
System Architecture¶
Process Flow¶
pg_cron (scheduled jobs)
↓
pg_net (async HTTP)
↓
run-monitor (Edge Function)
↓
fn_check_*_v2() (PostgreSQL functions)
↓
monitoring_runs (logging table)
↓
ops-alerter (Edge Function) [if threshold breached]
↓
Slack (#ops-alerts / #ops-urgent / #daily-ops)
Slack Channels¶
| Channel | Purpose | Severity Levels |
|---|---|---|
| #ops-alerts | Standard operational alerts | warning |
| #ops-urgent | Critical issues requiring immediate action | critical |
| #daily-ops | Daily digest and info notifications | info |
Monitors¶
1. Order Ingestion SLO¶
Purpose: Detects orders received via webhook but not ingested within SLO threshold
Function: raw_ops.fn_check_order_ingestion_slo_v2()
Schedule: Every 2 minutes (*/2 * * * *)
Channel: #ops-alerts (WARN) / #ops-urgent (PAGE)
Thresholds: | Age | Severity | Action | |-----|----------|--------| | 5-15 minutes | WARN | Alert to #ops-alerts | | >15 minutes | PAGE | Alert to #ops-urgent |
Alert Fields: - Orders Affected (count) - Oldest (minutes) - Severity (WARN/PAGE) - Order IDs (Shopify IDs)
Source Logic: Compares ops_events (webhook received) against orders table (ingested) within 20-minute lookback window.
2. Outbox Health¶
Purpose: Monitors email outbox queue for stuck or backed-up emails
Function: raw_ops.fn_check_outbox_health_v2()
Schedule: Every 2 minutes (*/2 * * * *)
Channel: #ops-alerts (warning) / #ops-urgent (critical)
Thresholds: | Condition | Severity | |-----------|----------| | stuck_emails > 10 OR pending_emails > 50 | warning | | stuck_emails > 50 OR pending_emails > 200 | critical |
Alert Fields: - Stuck Emails (>30 min old) - Pending Emails (total) - Failed Today (count) - Oldest Pending (minutes)
Source: raw_ops.v_outbox_health view
3. Stock Levels¶
Purpose: Alerts when product inventory falls below reorder threshold
Function: raw_ops.fn_check_stock_levels_v2()
Schedule: Every 15 minutes (*/15 * * * *)
Channel: #ops-urgent (always critical)
Thresholds: - Any product below configured threshold triggers alert
Alert Fields: - SKU - Units Available - kg Available - Threshold Units
Source: public.get_low_stock_alerts() function
Action Required: Contact co-packer to schedule production run
4. Lab SLA¶
Purpose: Detects batches stuck in QA_HOLD beyond 4-day SLA
Function: raw_ops.fn_check_lab_sla_v2()
Schedule: Every 15 minutes (*/15 * * * *)
Channel: #ops-alerts (warning)
Thresholds: - Any batch in QA_HOLD > 4 days (96 hours) triggers alert
Alert Fields: - Lab Name - SLA Breaches (count) - Avg Turnaround (hours) - P95 Turnaround (hours) - Compliance %
Source: raw_ops.v_lab_sla_tracking view
Action Required: Contact laboratory immediately
5. Daily Snapshot¶
Purpose: Provides daily operational digest at 09:00 UTC
Function: raw_ops.fn_daily_snapshot_v2()
Schedule: Daily at 09:00 UTC (0 9 * * *)
Channel: #daily-ops (always sends)
Health Status Logic: | Condition | Indicator | |-----------|-----------| | exceptions_logged_24h > 0 | 🔴 | | batches_qa_hold_current > 5 | 🟡 | | Otherwise | 🟢 |
Alert Fields: - 📦 Orders Processed - 💰 Revenue (£) - 🚚 Shipments Dispatched - ⚠ï¸ Exceptions - 🧪 QA Hold (batches) - ⚡ Query Time (ms)
Source: public.get_daily_snapshot() function
6. Working Capital¶
Purpose: Monitors cash tied up in QA_HOLD inventory
Function: raw_ops.fn_check_working_capital_v2()
Schedule: Every 6 hours (0 */6 * * *)
Channel: #ops-alerts (warning) / #ops-urgent (critical)
Thresholds: | Condition | Severity | |-----------|----------| | cash_tied_up > £400k OR oldest_batch > 5 days | warning | | cash_tied_up > £600k OR oldest_batch > 7 days | critical |
Alert Fields: - Cash Tied Up (£) - Batches in Hold (count) - Total kg Held - Oldest Batch (days) - Avg Hold Time (days)
Source: raw_ops.v_working_capital_metrics view
Action Required: Review lab backlog, expedite testing
Database Schema¶
monitoring_runs¶
Central logging table for all monitor executions.
CREATE TABLE raw_ops.monitoring_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
check_name TEXT NOT NULL,
run_at TIMESTAMPTZ DEFAULT now(),
duration_ms INTEGER,
status TEXT DEFAULT 'running', -- running, success, error
result_summary TEXT,
result_json JSONB,
alert_triggered BOOLEAN DEFAULT false,
error_message TEXT
);
CREATE INDEX idx_monitoring_runs_check_name ON raw_ops.monitoring_runs(check_name);
CREATE INDEX idx_monitoring_runs_run_at ON raw_ops.monitoring_runs(run_at DESC);
Useful Queries¶
Last 24 hours summary:
SELECT
check_name,
COUNT(*) as runs_24h,
COUNT(*) FILTER (WHERE status = 'error') as errors,
COUNT(*) FILTER (WHERE alert_triggered) as alerts_sent,
ROUND(AVG(duration_ms), 1) as avg_ms
FROM raw_ops.monitoring_runs
WHERE run_at > now() - interval '24 hours'
GROUP BY check_name
ORDER BY check_name;
Recent failures:
SELECT check_name, run_at, error_message
FROM raw_ops.monitoring_runs
WHERE status = 'error'
AND run_at > now() - interval '24 hours'
ORDER BY run_at DESC;
Alert history:
SELECT check_name, run_at, result_summary
FROM raw_ops.monitoring_runs
WHERE alert_triggered = true
AND run_at > now() - interval '7 days'
ORDER BY run_at DESC;
Edge Functions¶
run-monitor¶
Location: supabase/functions/run-monitor/index.ts
Purpose: Generic monitor runner that routes to specific check functions
Endpoint: POST /functions/v1/run-monitor
Request Body:
Available Checks:
- order_ingestion_slo
- outbox_health
- stock_levels
- lab_sla
- daily_snapshot
- working_capital
Response:
{
"success": true,
"run_id": "uuid",
"check_name": "order_ingestion_slo",
"violation_count": 0,
"duration_ms": 5,
"alert_sent": false,
"alert_config": null
}
ops-alerter¶
Location: supabase/functions/ops-alerter/index.ts
Purpose: Sends formatted alerts to Slack via Kai bot
Endpoint: POST /functions/v1/ops-alerter
JWT Verification: Disabled (internal calls only)
Request Body:
{
"channel": "ops-alerts",
"severity": "warning",
"title": "Alert Title",
"message": "Alert description",
"fields": [
{"label": "Field 1", "value": "Value 1", "inline": true}
],
"source_check": "order_ingestion_slo"
}
Channels:
- ops-alerts → #ops-alerts
- ops-urgent → #ops-urgent
- daily-ops → #daily-ops
pg_cron Jobs¶
View All Monitor Jobs¶
SELECT jobid, jobname, schedule, active
FROM cron.job
WHERE jobname LIKE 'monitor-%'
ORDER BY jobname;
Job Management¶
Disable a job:
Enable a job:
Delete a job:
Reschedule a job:
-- First delete, then recreate with new schedule
SELECT cron.unschedule('monitor-stock-levels');
SELECT cron.schedule(
'monitor-stock-levels',
'*/30 * * * *', -- New schedule: every 30 minutes
$$ ... $$
);
Adding a New Monitor¶
Step 1: Create PostgreSQL Function¶
CREATE OR REPLACE FUNCTION raw_ops.fn_check_NEW_MONITOR_v2()
RETURNS jsonb
LANGUAGE plpgsql
SECURITY DEFINER
AS $$
DECLARE
v_run_id uuid;
v_start_time timestamptz;
v_duration_ms integer;
v_alert_needed boolean := false;
BEGIN
v_start_time := clock_timestamp();
-- Create monitoring run record
INSERT INTO raw_ops.monitoring_runs (check_name, status)
VALUES ('new_monitor', 'running')
RETURNING id INTO v_run_id;
-- Your check logic here
-- Set v_alert_needed = true if threshold breached
v_duration_ms := EXTRACT(EPOCH FROM (clock_timestamp() - v_start_time)) * 1000;
-- Update monitoring run
UPDATE raw_ops.monitoring_runs
SET
status = 'success',
duration_ms = v_duration_ms,
result_summary = 'Your summary',
result_json = '{"your": "data"}'::jsonb
WHERE id = v_run_id;
-- Return result
RETURN jsonb_build_object(
'success', true,
'run_id', v_run_id,
'check_name', 'new_monitor',
'duration_ms', v_duration_ms,
'alert_config', CASE
WHEN NOT v_alert_needed THEN NULL
ELSE jsonb_build_object(
'channel', 'ops-alerts',
'severity', 'warning',
'title', 'Your Alert Title',
'message', 'Your alert message',
'fields', jsonb_build_array(
jsonb_build_object('label', 'Field', 'value', 'Value', 'inline', true)
)
)
END
);
EXCEPTION WHEN OTHERS THEN
v_duration_ms := EXTRACT(EPOCH FROM (clock_timestamp() - v_start_time)) * 1000;
UPDATE raw_ops.monitoring_runs
SET status = 'error', duration_ms = v_duration_ms, error_message = SQLERRM
WHERE id = v_run_id;
RETURN jsonb_build_object(
'success', false,
'run_id', v_run_id,
'check_name', 'new_monitor',
'error', SQLERRM,
'duration_ms', v_duration_ms
);
END;
$$;
Step 2: Update run-monitor Edge Function¶
Add to MONITOR_FUNCTIONS map in index.ts:
const MONITOR_FUNCTIONS: Record<string, string> = {
// ... existing monitors
new_monitor: "fn_check_NEW_MONITOR_v2",
};
Deploy:
Step 3: Test Manually¶
Step 4: Schedule with pg_cron¶
SELECT cron.schedule(
'monitor-new-monitor',
'*/15 * * * *',
$$
SELECT net.http_post(
url := 'https://znfjpllsiuyezqlneqzr.supabase.co/functions/v1/run-monitor',
headers := '{"Content-Type": "application/json", "apikey": "YOUR_ANON_KEY"}'::jsonb,
body := '{"check": "new_monitor"}'::jsonb
);
$$
);
Alert Response Runbook¶
Order Ingestion SLO Breach¶
Severity: WARN (5-15 min) / PAGE (>15 min)
Immediate Actions:
1. Check Make.com scenario "Order Ingestion" execution history
2. Check Shopify webhook delivery status
3. Check Supabase Edge Function logs for shopify-order-webhook
4. Verify database connectivity
Root Causes: - Make.com scenario disabled or failed - Shopify webhook endpoint changed - Edge Function deployment issue - Database connection pool exhausted
Outbox Health Alert¶
Severity: warning (>10 stuck) / critical (>50 stuck)
Immediate Actions:
1. Check pg_cron job process-outbox status
2. Check Customer.io API status
3. Review failed emails in outbox table
4. Check for rate limiting
Root Causes: - pg_cron job disabled - Customer.io API key expired - Email content triggering spam filters - Rate limit exceeded
Low Stock Alert¶
Severity: critical (always)
Immediate Actions: 1. Verify current stock levels in ops portal 2. Check pending orders against available stock 3. Contact co-packer immediately 4. Consider pausing new subscriptions if critical
Root Causes: - Higher than expected demand - Production delays at co-packer - Batch failures reducing available stock
Lab SLA Breach¶
Severity: warning
Immediate Actions: 1. Identify affected batches in ops portal 2. Contact laboratory for status update 3. Request expedited processing if possible 4. Update expected release dates
Root Causes: - Laboratory backlog - Sample issues requiring retest - Communication delays
Working Capital Alert¶
Severity: warning (>£400k) / critical (>£600k)
Immediate Actions: 1. Review batches in QA_HOLD 2. Contact laboratory for bulk status update 3. Prioritise oldest batches for release 4. Assess cash flow impact
Root Causes: - Multiple batches pending lab results - Lab SLA breaches accumulating - Production run timing issues
Troubleshooting¶
Monitor Not Running¶
-
Check pg_cron job exists and is active:
-
Check recent pg_cron executions:
-
Verify Edge Function is deployed:
Alert Not Sending¶
-
Check monitoring_runs for alert_triggered status:
-
Check ops-alerter Edge Function logs in Supabase Dashboard
-
Verify Slack webhook URL in environment variables
-
Test ops-alerter directly with manual payload
High Duration Times¶
-
Check query performance:
-
Review underlying view/function performance
-
Add missing indexes if needed
Version History¶
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2026-01-17 | Initial release - migrated 6 monitors from Make.com | Anton |
Appendix: Full Monitor Function List¶
| Check Name | Function | View/Source |
|---|---|---|
| order_ingestion_slo | fn_check_order_ingestion_slo_v2 | check_order_ingestion_slo() |
| outbox_health | fn_check_outbox_health_v2 | v_outbox_health |
| stock_levels | fn_check_stock_levels_v2 | get_low_stock_alerts() |
| lab_sla | fn_check_lab_sla_v2 | v_lab_sla_tracking |
| daily_snapshot | fn_daily_snapshot_v2 | get_daily_snapshot() |
| working_capital | fn_check_working_capital_v2 | v_working_capital_metrics |