Skip to content

SOP-MON-01: Monitoring & Alerting Architecture v1.0

Centralised monitoring system for Protocol Raw operations

Document ID: SOP-MON-01-v1.0
Version: 1.0
Status: ✅ Production Ready
Last Updated: 2026-01-17
Owner: Protocol Raw Operations
Review Date: 2026-04-17


Purpose

This SOP defines the centralised monitoring and alerting architecture for Protocol Raw operations. All operational monitors run natively within the Supabase stack, eliminating external dependencies on Make.com for critical alerting.

Key Principles: - Native Stack: All monitoring runs within Supabase (pg_cron, Edge Functions, PostgreSQL) - Single Alerter: One Edge Function (ops-alerter) handles all Slack notifications - Unified Logging: All monitor runs logged to raw_ops.monitoring_runs - Scale First: Architecture tested for 100,000+ customers


Scope

Applies to: All automated operational monitoring and alerting

Components: - 6 PostgreSQL monitoring functions - 2 Edge Functions (run-monitor, ops-alerter) - 6 pg_cron scheduled jobs - 3 Slack channels (#ops-alerts, #ops-urgent, #daily-ops)

Related SOPs: - SOP-01: Batch Creation & Lab-to-Release (Lab SLA, Working Capital) - SOP-03: Daily Operations (Daily Snapshot) - SOP-05: Email Processing (Outbox Health) - SOP-06: Order Ingestion SLO - SOP-INV-01: Inventory Control (Stock Levels)


System Architecture

Process Flow

pg_cron (scheduled jobs)
pg_net (async HTTP)
run-monitor (Edge Function)
fn_check_*_v2() (PostgreSQL functions)
monitoring_runs (logging table)
ops-alerter (Edge Function) [if threshold breached]
Slack (#ops-alerts / #ops-urgent / #daily-ops)

Slack Channels

Channel Purpose Severity Levels
#ops-alerts Standard operational alerts warning
#ops-urgent Critical issues requiring immediate action critical
#daily-ops Daily digest and info notifications info

Monitors

1. Order Ingestion SLO

Purpose: Detects orders received via webhook but not ingested within SLO threshold

Function: raw_ops.fn_check_order_ingestion_slo_v2()
Schedule: Every 2 minutes (*/2 * * * *)
Channel: #ops-alerts (WARN) / #ops-urgent (PAGE)

Thresholds: | Age | Severity | Action | |-----|----------|--------| | 5-15 minutes | WARN | Alert to #ops-alerts | | >15 minutes | PAGE | Alert to #ops-urgent |

Alert Fields: - Orders Affected (count) - Oldest (minutes) - Severity (WARN/PAGE) - Order IDs (Shopify IDs)

Source Logic: Compares ops_events (webhook received) against orders table (ingested) within 20-minute lookback window.


2. Outbox Health

Purpose: Monitors email outbox queue for stuck or backed-up emails

Function: raw_ops.fn_check_outbox_health_v2()
Schedule: Every 2 minutes (*/2 * * * *)
Channel: #ops-alerts (warning) / #ops-urgent (critical)

Thresholds: | Condition | Severity | |-----------|----------| | stuck_emails > 10 OR pending_emails > 50 | warning | | stuck_emails > 50 OR pending_emails > 200 | critical |

Alert Fields: - Stuck Emails (>30 min old) - Pending Emails (total) - Failed Today (count) - Oldest Pending (minutes)

Source: raw_ops.v_outbox_health view


3. Stock Levels

Purpose: Alerts when product inventory falls below reorder threshold

Function: raw_ops.fn_check_stock_levels_v2()
Schedule: Every 15 minutes (*/15 * * * *)
Channel: #ops-urgent (always critical)

Thresholds: - Any product below configured threshold triggers alert

Alert Fields: - SKU - Units Available - kg Available - Threshold Units

Source: public.get_low_stock_alerts() function

Action Required: Contact co-packer to schedule production run


4. Lab SLA

Purpose: Detects batches stuck in QA_HOLD beyond 4-day SLA

Function: raw_ops.fn_check_lab_sla_v2()
Schedule: Every 15 minutes (*/15 * * * *)
Channel: #ops-alerts (warning)

Thresholds: - Any batch in QA_HOLD > 4 days (96 hours) triggers alert

Alert Fields: - Lab Name - SLA Breaches (count) - Avg Turnaround (hours) - P95 Turnaround (hours) - Compliance %

Source: raw_ops.v_lab_sla_tracking view

Action Required: Contact laboratory immediately


5. Daily Snapshot

Purpose: Provides daily operational digest at 09:00 UTC

Function: raw_ops.fn_daily_snapshot_v2()
Schedule: Daily at 09:00 UTC (0 9 * * *)
Channel: #daily-ops (always sends)

Health Status Logic: | Condition | Indicator | |-----------|-----------| | exceptions_logged_24h > 0 | 🔴 | | batches_qa_hold_current > 5 | 🟡 | | Otherwise | 🟢 |

Alert Fields: - 📦 Orders Processed - 💰 Revenue (£) - 🚚 Shipments Dispatched - ⚠️ Exceptions - 🧪 QA Hold (batches) - ⚡ Query Time (ms)

Source: public.get_daily_snapshot() function


6. Working Capital

Purpose: Monitors cash tied up in QA_HOLD inventory

Function: raw_ops.fn_check_working_capital_v2()
Schedule: Every 6 hours (0 */6 * * *)
Channel: #ops-alerts (warning) / #ops-urgent (critical)

Thresholds: | Condition | Severity | |-----------|----------| | cash_tied_up > £400k OR oldest_batch > 5 days | warning | | cash_tied_up > £600k OR oldest_batch > 7 days | critical |

Alert Fields: - Cash Tied Up (£) - Batches in Hold (count) - Total kg Held - Oldest Batch (days) - Avg Hold Time (days)

Source: raw_ops.v_working_capital_metrics view

Action Required: Review lab backlog, expedite testing


Database Schema

monitoring_runs

Central logging table for all monitor executions.

CREATE TABLE raw_ops.monitoring_runs (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  check_name TEXT NOT NULL,
  run_at TIMESTAMPTZ DEFAULT now(),
  duration_ms INTEGER,
  status TEXT DEFAULT 'running',  -- running, success, error
  result_summary TEXT,
  result_json JSONB,
  alert_triggered BOOLEAN DEFAULT false,
  error_message TEXT
);

CREATE INDEX idx_monitoring_runs_check_name ON raw_ops.monitoring_runs(check_name);
CREATE INDEX idx_monitoring_runs_run_at ON raw_ops.monitoring_runs(run_at DESC);

Useful Queries

Last 24 hours summary:

SELECT 
  check_name,
  COUNT(*) as runs_24h,
  COUNT(*) FILTER (WHERE status = 'error') as errors,
  COUNT(*) FILTER (WHERE alert_triggered) as alerts_sent,
  ROUND(AVG(duration_ms), 1) as avg_ms
FROM raw_ops.monitoring_runs
WHERE run_at > now() - interval '24 hours'
GROUP BY check_name
ORDER BY check_name;

Recent failures:

SELECT check_name, run_at, error_message
FROM raw_ops.monitoring_runs
WHERE status = 'error'
  AND run_at > now() - interval '24 hours'
ORDER BY run_at DESC;

Alert history:

SELECT check_name, run_at, result_summary
FROM raw_ops.monitoring_runs
WHERE alert_triggered = true
  AND run_at > now() - interval '7 days'
ORDER BY run_at DESC;


Edge Functions

run-monitor

Location: supabase/functions/run-monitor/index.ts
Purpose: Generic monitor runner that routes to specific check functions

Endpoint: POST /functions/v1/run-monitor

Request Body:

{"check": "order_ingestion_slo"}

Available Checks: - order_ingestion_slo - outbox_health - stock_levels - lab_sla - daily_snapshot - working_capital

Response:

{
  "success": true,
  "run_id": "uuid",
  "check_name": "order_ingestion_slo",
  "violation_count": 0,
  "duration_ms": 5,
  "alert_sent": false,
  "alert_config": null
}

ops-alerter

Location: supabase/functions/ops-alerter/index.ts
Purpose: Sends formatted alerts to Slack via Kai bot

Endpoint: POST /functions/v1/ops-alerter
JWT Verification: Disabled (internal calls only)

Request Body:

{
  "channel": "ops-alerts",
  "severity": "warning",
  "title": "Alert Title",
  "message": "Alert description",
  "fields": [
    {"label": "Field 1", "value": "Value 1", "inline": true}
  ],
  "source_check": "order_ingestion_slo"
}

Channels: - ops-alerts → #ops-alerts - ops-urgent → #ops-urgent - daily-ops → #daily-ops


pg_cron Jobs

View All Monitor Jobs

SELECT jobid, jobname, schedule, active
FROM cron.job 
WHERE jobname LIKE 'monitor-%'
ORDER BY jobname;

Job Management

Disable a job:

UPDATE cron.job SET active = false WHERE jobname = 'monitor-working-capital';

Enable a job:

UPDATE cron.job SET active = true WHERE jobname = 'monitor-working-capital';

Delete a job:

SELECT cron.unschedule('monitor-working-capital');

Reschedule a job:

-- First delete, then recreate with new schedule
SELECT cron.unschedule('monitor-stock-levels');
SELECT cron.schedule(
  'monitor-stock-levels',
  '*/30 * * * *',  -- New schedule: every 30 minutes
  $$ ... $$
);


Adding a New Monitor

Step 1: Create PostgreSQL Function

CREATE OR REPLACE FUNCTION raw_ops.fn_check_NEW_MONITOR_v2()
RETURNS jsonb
LANGUAGE plpgsql
SECURITY DEFINER
AS $$
DECLARE
  v_run_id uuid;
  v_start_time timestamptz;
  v_duration_ms integer;
  v_alert_needed boolean := false;
BEGIN
  v_start_time := clock_timestamp();

  -- Create monitoring run record
  INSERT INTO raw_ops.monitoring_runs (check_name, status)
  VALUES ('new_monitor', 'running')
  RETURNING id INTO v_run_id;

  -- Your check logic here
  -- Set v_alert_needed = true if threshold breached

  v_duration_ms := EXTRACT(EPOCH FROM (clock_timestamp() - v_start_time)) * 1000;

  -- Update monitoring run
  UPDATE raw_ops.monitoring_runs
  SET 
    status = 'success',
    duration_ms = v_duration_ms,
    result_summary = 'Your summary',
    result_json = '{"your": "data"}'::jsonb
  WHERE id = v_run_id;

  -- Return result
  RETURN jsonb_build_object(
    'success', true,
    'run_id', v_run_id,
    'check_name', 'new_monitor',
    'duration_ms', v_duration_ms,
    'alert_config', CASE 
      WHEN NOT v_alert_needed THEN NULL
      ELSE jsonb_build_object(
        'channel', 'ops-alerts',
        'severity', 'warning',
        'title', 'Your Alert Title',
        'message', 'Your alert message',
        'fields', jsonb_build_array(
          jsonb_build_object('label', 'Field', 'value', 'Value', 'inline', true)
        )
      )
    END
  );

EXCEPTION WHEN OTHERS THEN
  v_duration_ms := EXTRACT(EPOCH FROM (clock_timestamp() - v_start_time)) * 1000;

  UPDATE raw_ops.monitoring_runs
  SET status = 'error', duration_ms = v_duration_ms, error_message = SQLERRM
  WHERE id = v_run_id;

  RETURN jsonb_build_object(
    'success', false,
    'run_id', v_run_id,
    'check_name', 'new_monitor',
    'error', SQLERRM,
    'duration_ms', v_duration_ms
  );
END;
$$;

Step 2: Update run-monitor Edge Function

Add to MONITOR_FUNCTIONS map in index.ts:

const MONITOR_FUNCTIONS: Record<string, string> = {
  // ... existing monitors
  new_monitor: "fn_check_NEW_MONITOR_v2",
};

Deploy:

supabase functions deploy run-monitor

Step 3: Test Manually

{"check": "new_monitor"}

Step 4: Schedule with pg_cron

SELECT cron.schedule(
  'monitor-new-monitor',
  '*/15 * * * *',
  $$
  SELECT net.http_post(
    url := 'https://znfjpllsiuyezqlneqzr.supabase.co/functions/v1/run-monitor',
    headers := '{"Content-Type": "application/json", "apikey": "YOUR_ANON_KEY"}'::jsonb,
    body := '{"check": "new_monitor"}'::jsonb
  );
  $$
);

Alert Response Runbook

Order Ingestion SLO Breach

Severity: WARN (5-15 min) / PAGE (>15 min)

Immediate Actions: 1. Check Make.com scenario "Order Ingestion" execution history 2. Check Shopify webhook delivery status 3. Check Supabase Edge Function logs for shopify-order-webhook 4. Verify database connectivity

Root Causes: - Make.com scenario disabled or failed - Shopify webhook endpoint changed - Edge Function deployment issue - Database connection pool exhausted


Outbox Health Alert

Severity: warning (>10 stuck) / critical (>50 stuck)

Immediate Actions: 1. Check pg_cron job process-outbox status 2. Check Customer.io API status 3. Review failed emails in outbox table 4. Check for rate limiting

Root Causes: - pg_cron job disabled - Customer.io API key expired - Email content triggering spam filters - Rate limit exceeded


Low Stock Alert

Severity: critical (always)

Immediate Actions: 1. Verify current stock levels in ops portal 2. Check pending orders against available stock 3. Contact co-packer immediately 4. Consider pausing new subscriptions if critical

Root Causes: - Higher than expected demand - Production delays at co-packer - Batch failures reducing available stock


Lab SLA Breach

Severity: warning

Immediate Actions: 1. Identify affected batches in ops portal 2. Contact laboratory for status update 3. Request expedited processing if possible 4. Update expected release dates

Root Causes: - Laboratory backlog - Sample issues requiring retest - Communication delays


Working Capital Alert

Severity: warning (>£400k) / critical (>£600k)

Immediate Actions: 1. Review batches in QA_HOLD 2. Contact laboratory for bulk status update 3. Prioritise oldest batches for release 4. Assess cash flow impact

Root Causes: - Multiple batches pending lab results - Lab SLA breaches accumulating - Production run timing issues


Troubleshooting

Monitor Not Running

  1. Check pg_cron job exists and is active:

    SELECT jobname, schedule, active FROM cron.job WHERE jobname = 'monitor-xxx';
    

  2. Check recent pg_cron executions:

    SELECT * FROM cron.job_run_details 
    WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'monitor-xxx')
    ORDER BY start_time DESC LIMIT 5;
    

  3. Verify Edge Function is deployed:

    supabase functions list
    

Alert Not Sending

  1. Check monitoring_runs for alert_triggered status:

    SELECT * FROM raw_ops.monitoring_runs 
    WHERE check_name = 'xxx' 
    ORDER BY run_at DESC LIMIT 5;
    

  2. Check ops-alerter Edge Function logs in Supabase Dashboard

  3. Verify Slack webhook URL in environment variables

  4. Test ops-alerter directly with manual payload

High Duration Times

  1. Check query performance:

    SELECT check_name, AVG(duration_ms), MAX(duration_ms)
    FROM raw_ops.monitoring_runs
    WHERE run_at > now() - interval '24 hours'
    GROUP BY check_name;
    

  2. Review underlying view/function performance

  3. Add missing indexes if needed


Version History

Version Date Changes Author
1.0 2026-01-17 Initial release - migrated 6 monitors from Make.com Anton

Appendix: Full Monitor Function List

Check Name Function View/Source
order_ingestion_slo fn_check_order_ingestion_slo_v2 check_order_ingestion_slo()
outbox_health fn_check_outbox_health_v2 v_outbox_health
stock_levels fn_check_stock_levels_v2 get_low_stock_alerts()
lab_sla fn_check_lab_sla_v2 v_lab_sla_tracking
daily_snapshot fn_daily_snapshot_v2 get_daily_snapshot()
working_capital fn_check_working_capital_v2 v_working_capital_metrics