SOP-MON-01: Monitoring & Alerting Architecture v1.0¶

Centralised monitoring system for Protocol Raw operations

Document ID: SOP-MON-01-v1.0
Version: 1.0
Status: ✅ Production Ready
Last Updated: 2026-01-17
Owner: Protocol Raw Operations
Review Date: 2026-04-17

Purpose¶

This SOP defines the centralised monitoring and alerting architecture for Protocol Raw operations. All operational monitors run natively within the Supabase stack, eliminating external dependencies on Make.com for critical alerting.

Key Principles: - Native Stack: All monitoring runs within Supabase (pg_cron, Edge Functions, PostgreSQL) - Single Alerter: One Edge Function (ops-alerter) handles all Slack notifications - Unified Logging: All monitor runs logged to raw_ops.monitoring_runs - Scale First: Architecture tested for 100,000+ customers

Scope¶

Applies to: All automated operational monitoring and alerting

Components: - 6 PostgreSQL monitoring functions - 2 Edge Functions (run-monitor, ops-alerter) - 6 pg_cron scheduled jobs - 3 Slack channels (#ops-alerts, #ops-urgent, #daily-ops)

Related SOPs: - SOP-01: Batch Creation & Lab-to-Release (Lab SLA, Working Capital) - SOP-03: Daily Operations (Daily Snapshot) - SOP-05: Email Processing (Outbox Health) - SOP-06: Order Ingestion SLO - SOP-INV-01: Inventory Control (Stock Levels)

System Architecture¶

Process Flow¶

pg_cron (scheduled jobs)
    ↓
pg_net (async HTTP)
    ↓
run-monitor (Edge Function)
    ↓
fn_check_*_v2() (PostgreSQL functions)
    ↓
monitoring_runs (logging table)
    ↓
ops-alerter (Edge Function) [if threshold breached]
    ↓
Slack (#ops-alerts / #ops-urgent / #daily-ops)

Slack Channels¶

Channel	Purpose	Severity Levels
#ops-alerts	Standard operational alerts	warning
#ops-urgent	Critical issues requiring immediate action	critical
#daily-ops	Daily digest and info notifications	info

Monitors¶

1. Order Ingestion SLO¶

Purpose: Detects orders received via webhook but not ingested within SLO threshold

Function: raw_ops.fn_check_order_ingestion_slo_v2()
Schedule: Every 2 minutes (*/2 * * * *)
Channel: #ops-alerts (WARN) / #ops-urgent (PAGE)

Thresholds: | Age | Severity | Action | |-----|----------|--------| | 5-15 minutes | WARN | Alert to #ops-alerts | | >15 minutes | PAGE | Alert to #ops-urgent |

Alert Fields: - Orders Affected (count) - Oldest (minutes) - Severity (WARN/PAGE) - Order IDs (Shopify IDs)

Source Logic: Compares ops_events (webhook received) against orders table (ingested) within 20-minute lookback window.

2. Outbox Health¶

Purpose: Monitors email outbox queue for stuck or backed-up emails

Function: raw_ops.fn_check_outbox_health_v2()
Schedule: Every 2 minutes (*/2 * * * *)
Channel: #ops-alerts (warning) / #ops-urgent (critical)

Thresholds: | Condition | Severity | |-----------|----------| | stuck_emails > 10 OR pending_emails > 50 | warning | | stuck_emails > 50 OR pending_emails > 200 | critical |

Alert Fields: - Stuck Emails (>30 min old) - Pending Emails (total) - Failed Today (count) - Oldest Pending (minutes)

Source: raw_ops.v_outbox_health view

3. Stock Levels¶

Purpose: Alerts when product inventory falls below reorder threshold

Function: raw_ops.fn_check_stock_levels_v2()
Schedule: Every 15 minutes (*/15 * * * *)
Channel: #ops-urgent (always critical)

Thresholds: - Any product below configured threshold triggers alert

Alert Fields: - SKU - Units Available - kg Available - Threshold Units

Source: public.get_low_stock_alerts() function

Action Required: Contact co-packer to schedule production run

4. Lab SLA¶

Purpose: Detects batches stuck in QA_HOLD beyond 4-day SLA

Function: raw_ops.fn_check_lab_sla_v2()
Schedule: Every 15 minutes (*/15 * * * *)
Channel: #ops-alerts (warning)

Thresholds: - Any batch in QA_HOLD > 4 days (96 hours) triggers alert

Alert Fields: - Lab Name - SLA Breaches (count) - Avg Turnaround (hours) - P95 Turnaround (hours) - Compliance %

Source: raw_ops.v_lab_sla_tracking view

Action Required: Contact laboratory immediately

5. Daily Snapshot¶

Purpose: Provides daily operational digest at 09:00 UTC

Function: raw_ops.fn_daily_snapshot_v2()
Schedule: Daily at 09:00 UTC (0 9 * * *)
Channel: #daily-ops (always sends)

Health Status Logic: | Condition | Indicator | |-----------|-----------| | exceptions_logged_24h > 0 | 🔴 | | batches_qa_hold_current > 5 | 🟡 | | Otherwise | 🟢 |

Alert Fields: - 📦 Orders Processed - 💰 Revenue (£) - 🚚 Shipments Dispatched - ⚠ï¸ Exceptions - 🧪 QA Hold (batches) - ⚡ Query Time (ms)

Source: public.get_daily_snapshot() function

6. Working Capital¶

Purpose: Monitors cash tied up in QA_HOLD inventory

Function: raw_ops.fn_check_working_capital_v2()
Schedule: Every 6 hours (0 */6 * * *)
Channel: #ops-alerts (warning) / #ops-urgent (critical)

Thresholds: | Condition | Severity | |-----------|----------| | cash_tied_up > £400k OR oldest_batch > 5 days | warning | | cash_tied_up > £600k OR oldest_batch > 7 days | critical |

Alert Fields: - Cash Tied Up (£) - Batches in Hold (count) - Total kg Held - Oldest Batch (days) - Avg Hold Time (days)

Source: raw_ops.v_working_capital_metrics view

Action Required: Review lab backlog, expedite testing

Database Schema¶

monitoring_runs¶

Central logging table for all monitor executions.

CREATE TABLE raw_ops.monitoring_runs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  check_name TEXT NOT NULL,
  run_at TIMESTAMPTZ DEFAULT now(),
  duration_ms INTEGER,
  status TEXT DEFAULT 'running',  -- running, success, error
  result_summary TEXT,
  result_json JSONB,
  alert_triggered BOOLEAN DEFAULT false,
  error_message TEXT
);

CREATE INDEX idx_monitoring_runs_check_name ON raw_ops.monitoring_runs(check_name);
CREATE INDEX idx_monitoring_runs_run_at ON raw_ops.monitoring_runs(run_at DESC);

Useful Queries¶

Last 24 hours summary:

SELECT 
  check_name,
  COUNT(*) as runs_24h,
  COUNT(*) FILTER (WHERE status = 'error') as errors,
  COUNT(*) FILTER (WHERE alert_triggered) as alerts_sent,
  ROUND(AVG(duration_ms), 1) as avg_ms
FROM raw_ops.monitoring_runs
WHERE run_at > now() - interval '24 hours'
GROUP BY check_name
ORDER BY check_name;

Recent failures:

SELECT check_name, run_at, error_message
FROM raw_ops.monitoring_runs
WHERE status = 'error'
  AND run_at > now() - interval '24 hours'
ORDER BY run_at DESC;

Alert history:

SELECT check_name, run_at, result_summary
FROM raw_ops.monitoring_runs
WHERE alert_triggered = true
  AND run_at > now() - interval '7 days'
ORDER BY run_at DESC;

Edge Functions¶

run-monitor¶

Location: supabase/functions/run-monitor/index.ts
Purpose: Generic monitor runner that routes to specific check functions

Endpoint: POST /functions/v1/run-monitor

Request Body:

{"check": "order_ingestion_slo"}

Available Checks: - order_ingestion_slo - outbox_health - stock_levels - lab_sla - daily_snapshot - working_capital

Response:

{
  "success": true,
  "run_id": "uuid",
  "check_name": "order_ingestion_slo",
  "violation_count": 0,
  "duration_ms": 5,
  "alert_sent": false,
  "alert_config": null
}

ops-alerter¶

Location: supabase/functions/ops-alerter/index.ts
Purpose: Sends formatted alerts to Slack via Kai bot

Endpoint: POST /functions/v1/ops-alerter
JWT Verification: Disabled (internal calls only)

Request Body:

{
  "channel": "ops-alerts",
  "severity": "warning",
  "title": "Alert Title",
  "message": "Alert description",
  "fields": [
    {"label": "Field 1", "value": "Value 1", "inline": true}
  ],
  "source_check": "order_ingestion_slo"
}

Channels: - ops-alerts → #ops-alerts - ops-urgent → #ops-urgent - daily-ops → #daily-ops

pg_cron Jobs¶

View All Monitor Jobs¶

SELECT jobid, jobname, schedule, active
FROM cron.job 
WHERE jobname LIKE 'monitor-%'
ORDER BY jobname;

Job Management¶

Disable a job:

UPDATE cron.job SET active = false WHERE jobname = 'monitor-working-capital';

Enable a job:

UPDATE cron.job SET active = true WHERE jobname = 'monitor-working-capital';

Delete a job:

SELECT cron.unschedule('monitor-working-capital');

Reschedule a job:

-- First delete, then recreate with new schedule
SELECT cron.unschedule('monitor-stock-levels');
SELECT cron.schedule(
  'monitor-stock-levels',
  '*/30 * * * *',  -- New schedule: every 30 minutes
  $$ ... $$
);

Adding a New Monitor¶

Step 1: Create PostgreSQL Function¶

CREATE OR REPLACE FUNCTION raw_ops.fn_check_NEW_MONITOR_v2()
RETURNS jsonb
LANGUAGE plpgsql
SECURITY DEFINER
AS $$
DECLARE
  v_run_id uuid;
  v_start_time timestamptz;
  v_duration_ms integer;
  v_alert_needed boolean := false;
BEGIN
  v_start_time := clock_timestamp();

  -- Create monitoring run record
  INSERT INTO raw_ops.monitoring_runs (check_name, status)
  VALUES ('new_monitor', 'running')
  RETURNING id INTO v_run_id;

  -- Your check logic here
  -- Set v_alert_needed = true if threshold breached

  v_duration_ms := EXTRACT(EPOCH FROM (clock_timestamp() - v_start_time)) * 1000;

  -- Update monitoring run
  UPDATE raw_ops.monitoring_runs
  SET 
    status = 'success',
    duration_ms = v_duration_ms,
    result_summary = 'Your summary',
    result_json = '{"your": "data"}'::jsonb
  WHERE id = v_run_id;

  -- Return result
  RETURN jsonb_build_object(
    'success', true,
    'run_id', v_run_id,
    'check_name', 'new_monitor',
    'duration_ms', v_duration_ms,
    'alert_config', CASE 
      WHEN NOT v_alert_needed THEN NULL
      ELSE jsonb_build_object(
        'channel', 'ops-alerts',
        'severity', 'warning',
        'title', 'Your Alert Title',
        'message', 'Your alert message',
        'fields', jsonb_build_array(
          jsonb_build_object('label', 'Field', 'value', 'Value', 'inline', true)
        )
      )
    END
  );

EXCEPTION WHEN OTHERS THEN
  v_duration_ms := EXTRACT(EPOCH FROM (clock_timestamp() - v_start_time)) * 1000;

  UPDATE raw_ops.monitoring_runs
  SET status = 'error', duration_ms = v_duration_ms, error_message = SQLERRM
  WHERE id = v_run_id;

  RETURN jsonb_build_object(
    'success', false,
    'run_id', v_run_id,
    'check_name', 'new_monitor',
    'error', SQLERRM,
    'duration_ms', v_duration_ms
  );
END;
$$;

Step 2: Update run-monitor Edge Function¶

Add to MONITOR_FUNCTIONS map in index.ts:

const MONITOR_FUNCTIONS: Record<string, string> = {
  // ... existing monitors
  new_monitor: "fn_check_NEW_MONITOR_v2",
};

Deploy:

supabase functions deploy run-monitor

Step 3: Test Manually¶

{"check": "new_monitor"}

Step 4: Schedule with pg_cron¶

SELECT cron.schedule(
  'monitor-new-monitor',
  '*/15 * * * *',
  $$
  SELECT net.http_post(
    url := 'https://znfjpllsiuyezqlneqzr.supabase.co/functions/v1/run-monitor',
    headers := '{"Content-Type": "application/json", "apikey": "YOUR_ANON_KEY"}'::jsonb,
    body := '{"check": "new_monitor"}'::jsonb
  );
  $$
);

Alert Response Runbook¶

Order Ingestion SLO Breach¶

Severity: WARN (5-15 min) / PAGE (>15 min)

Immediate Actions: 1. Check Make.com scenario "Order Ingestion" execution history 2. Check Shopify webhook delivery status 3. Check Supabase Edge Function logs for shopify-order-webhook 4. Verify database connectivity

Root Causes: - Make.com scenario disabled or failed - Shopify webhook endpoint changed - Edge Function deployment issue - Database connection pool exhausted

Outbox Health Alert¶

Severity: warning (>10 stuck) / critical (>50 stuck)

Immediate Actions: 1. Check pg_cron job process-outbox status 2. Check Customer.io API status 3. Review failed emails in outbox table 4. Check for rate limiting

Root Causes: - pg_cron job disabled - Customer.io API key expired - Email content triggering spam filters - Rate limit exceeded

Low Stock Alert¶

Severity: critical (always)

Immediate Actions: 1. Verify current stock levels in ops portal 2. Check pending orders against available stock 3. Contact co-packer immediately 4. Consider pausing new subscriptions if critical

Root Causes: - Higher than expected demand - Production delays at co-packer - Batch failures reducing available stock

Lab SLA Breach¶

Severity: warning

Immediate Actions: 1. Identify affected batches in ops portal 2. Contact laboratory for status update 3. Request expedited processing if possible 4. Update expected release dates

Root Causes: - Laboratory backlog - Sample issues requiring retest - Communication delays

Working Capital Alert¶

Severity: warning (>£400k) / critical (>£600k)

Immediate Actions: 1. Review batches in QA_HOLD 2. Contact laboratory for bulk status update 3. Prioritise oldest batches for release 4. Assess cash flow impact

Root Causes: - Multiple batches pending lab results - Lab SLA breaches accumulating - Production run timing issues

Troubleshooting¶

Monitor Not Running¶

Check pg_cron job exists and is active:

SELECT jobname, schedule, active FROM cron.job WHERE jobname = 'monitor-xxx';

Check recent pg_cron executions:

SELECT * FROM cron.job_run_details 
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'monitor-xxx')
ORDER BY start_time DESC LIMIT 5;

Verify Edge Function is deployed:
```
supabase functions list
```

Alert Not Sending¶

Check monitoring_runs for alert_triggered status:

SELECT * FROM raw_ops.monitoring_runs 
WHERE check_name = 'xxx' 
ORDER BY run_at DESC LIMIT 5;

Check ops-alerter Edge Function logs in Supabase Dashboard
Verify Slack webhook URL in environment variables
Test ops-alerter directly with manual payload

High Duration Times¶

Check query performance:

SELECT check_name, AVG(duration_ms), MAX(duration_ms)
FROM raw_ops.monitoring_runs
WHERE run_at > now() - interval '24 hours'
GROUP BY check_name;

Review underlying view/function performance
Add missing indexes if needed

Version History¶

Version	Date	Changes	Author
1.0	2026-01-17	Initial release - migrated 6 monitors from Make.com	Anton

Appendix: Full Monitor Function List¶

Check Name	Function	View/Source
order_ingestion_slo	fn_check_order_ingestion_slo_v2	check_order_ingestion_slo()
outbox_health	fn_check_outbox_health_v2	v_outbox_health
stock_levels	fn_check_stock_levels_v2	get_low_stock_alerts()
lab_sla	fn_check_lab_sla_v2	v_lab_sla_tracking
daily_snapshot	fn_daily_snapshot_v2	get_daily_snapshot()
working_capital	fn_check_working_capital_v2	v_working_capital_metrics