SOP-DLV-01: Courier Watchdog v4.0¶
Proactive delivery exception detection, customer notification, and courier event traceability
Document ID: SOP-DLV-01-v4.0 Version: 4.0 Status: ✅ Production Ready Last Updated: 2026-03-19 Owner: Protocol Raw Operations Replaces: SOP-02-v3.0 Review Date: 2026-06-19
Key Changes in v4.0¶
- ✅ Removed Make.com from the notification delivery path (Customer.io called directly via Edge Function)
- ✅ Added explicit retry logic with exponential backoff (5 attempts: immediate, 5m, 15m, 1h, 6h)
- ✅ Added notification state tracking columns on
ops_events(notification_status,notification_attempts,notification_last_error,notification_next_retry_at) - ✅ Rebuilt monitoring to SOP-MON-01 pattern (
fn_check_courier_watchdog_health_v2viarun-monitor) - ✅ Added idempotency index on
courier_events(shipment_id,event_code,event_time) - ✅ Enabled RLS on
courier_events,courier_event_mapping, andshipments - ✅ Converted SOP from PDF to markdown
- ✅ Renamed from SOP-02 to SOP-DLV-01 (delivery operations namespace)
- ❌ DPD inbound webhook still via Make.com (pending DPD UK API confirmation)
Purpose¶
The Courier Watchdog system proactively detects delivery exceptions reported by DPD, notifies customers with reassuring branded emails, and maintains full traceability of all courier events. It operates automatically with minimal manual intervention using real-time database triggers.
The system exists to protect the customer experience during delivery. When something goes wrong (missed delivery, weather delay, address issue, package damage), the customer should hear from Protocol Raw before they need to chase it themselves.
Scope¶
Applies to: All DPD courier exception events for Protocol Raw shipments
Components:
- 4 database tables (courier_events, courier_event_mapping, shipments, ops_events)
- 1 PostgreSQL processing function (process_courier_event)
- 1 PostgreSQL trigger function (notify_courier_exception)
- 1 Edge Function for notification delivery (send-courier-notification)
- 1 PostgreSQL retry function (fn_retry_courier_notifications_v1)
- 1 PostgreSQL monitoring function (fn_check_courier_watchdog_health_v2)
- 1 Customer.io campaign with 5 email templates
- 2 pg_cron jobs (retry every 5 min, health check every 15 min)
Related SOPs: - SOP-MON-01: Monitoring and Alerting Architecture (monitoring pattern) - SOP-ORD-01: Order Lifecycle (upstream, creates shipments) - SOP-PACK-01: Pick, Pack and Dispatch (upstream, dispatches shipments) - SOP-CS-00: Customer Operations System (delivery tracking in Customer Portal) - SOP-LC-01: Lifecycle Communications (dispatch/delivery confirmation emails)
System Architecture¶
Data Flow¶
DPD Exception Event
↓
Make.com Webhook (inbound only)
↓
raw_ops.process_courier_event()
↓
courier_events (logged) + ops_events (CUSTOMER_EVENT created)
↓
Database Trigger: trigger_notify_courier_exception
↓
notify_courier_exception() → pg_net
↓
Edge Function: send-courier-notification
↓
Customer.io Track API (courier_exception event)
↓
Campaign: "Courier Exception Handler" (multi-split by notification_template_id)
↓
Branded email to customer
Processing time: Less than 10 seconds end-to-end from DPD event to customer inbox.
Component Responsibilities¶
Make.com (inbound only): Receives DPD webhook events and calls process_courier_event() via Supabase REST API. This is the only Make.com involvement. Make.com does not handle notification delivery or monitoring.
PostgreSQL (process_courier_event): Core processing function. Looks up the event code in courier_event_mapping, logs to courier_events, retrieves customer and shipment data, and creates a CUSTOMER_EVENT in ops_events with the full notification payload.
Database trigger (notify_courier_exception): Fires on INSERT to ops_events where kind = 'CUSTOMER_EVENT' and customer_event_type = 'courier_exception'. Calls the Edge Function via pg_net (async, non-blocking).
Edge Function (send-courier-notification): Delivers the notification to Customer.io Track API directly. Handles idempotency (safe to call multiple times for the same event), success/failure tracking, and exponential backoff scheduling for retries.
pg_cron retry processor: Runs every 5 minutes. Picks up events where notification_status = 'pending' and notification_next_retry_at <= now(), re-invokes the Edge Function. Maximum 5 attempts per event.
pg_cron health monitor: Runs every 15 minutes via run-monitor. Checks for stuck and permanently failed notifications, alerts via Slack following SOP-MON-01 pattern.
Event Code Mapping¶
| Code | Event Type | Template ID | Customer Message | Priority |
|---|---|---|---|---|
| 01 | customer_not_home | missed_delivery_retry | We missed you today. Will retry tomorrow | medium |
| 02 | premises_closed | missed_delivery_retry | Premises closed. Will retry tomorrow | medium |
| 03 | address_issue | address_problem | Address issue. Please confirm details | high |
| 08 | access_issue | access_problem | Driver unable to access. Will retry | medium |
| 09 | weather_delay | weather_delay | Weather delay. Box is safe | low |
| 12 | damaged_in_transit | package_damage_review | Package damage. Under review | high |
Source table: raw_ops.courier_event_mapping
SELECT event_code, event_type, notification_template_id, requires_customer_notification, priority
FROM raw_ops.courier_event_mapping
ORDER BY event_code;
Email Templates (Customer.io)¶
All templates live in the Customer.io campaign "Courier Exception Handler". The campaign uses a multi-split branch on the notification_template_id event attribute.
Template 1: Missed Delivery (Will Retry)
- Template ID: missed_delivery_retry
- Subject: Delivery update: We're monitoring your box
- Trigger codes: 01, 02
- Tone: Reassuring, proactive
- Key message: Packaging is over-engineered for 48-72 hour delays
Template 2: Access Issue
- Template ID: access_problem
- Subject: Delivery update: Access issue detected
- Trigger code: 08
- Tone: Helpful, solution-focused
- Key message: Will retry with updated instructions
Template 3: Weather Delay
- Template ID: weather_delay
- Subject: Delivery update: Weather delay in your area
- Trigger code: 09
- Tone: Calm, reassuring
- Key message: Safety first, packaging handles delays
Template 4: Address Issue (High Priority)
- Template ID: address_problem
- Subject: Action needed: Address issue detected
- Trigger code: 03
- Tone: Urgent but helpful
- Key message: Customer action required to resolve
- SLA: 2-hour manual follow-up
Template 5: Package Damage (Automated)
- Template ID: package_damage_review
- Subject: Important: Issue with your Protocol Raw delivery
- Trigger code: 12
- Tone: Clear, systematic
- Key message: Team reviewing, will contact within 4 hours
- SLA: 4-hour resolution
Notification Delivery¶
Edge Function: send-courier-notification¶
Location: supabase/functions/send-courier-notification/index.ts
JWT Verification: Disabled (internal calls from pg_net)
Secrets Required: CUSTOMERIO_SITE_ID, CUSTOMERIO_API_KEY
Behaviour:
- Receives
{ event_id }from the database trigger or retry processor - Fetches the event from
ops_events - Idempotency check: If
processed_atis already set ornotification_status = 'sent', returns success without re-sending - Skip check: If no email in the payload, marks as
notification_status = 'skipped'and returns - Sends
courier_exceptionevent to Customer.io Track API with the full payload - On success: sets
processed_at,notification_status = 'sent', incrementsnotification_attempts - On failure: increments
notification_attempts, records error innotification_last_error, calculates next retry time with exponential backoff, setsnotification_status = 'pending'(or'failed'if max attempts exhausted) - On max retries exhausted (5 attempts): sends alert to
#ops-alertsviaops-alerter
Customer.io Payload¶
The Edge Function sends to the Customer.io Track API v1:
POST https://track.customer.io/api/v1/customers/{email}/events
Authorization: Basic {base64(site_id:api_key)}
{
"name": "courier_exception",
"data": {
"notification_template_id": "missed_delivery_retry",
"event_type": "customer_not_home",
"event_description": "Customer not home - will retry tomorrow",
"first_name": "Alice",
"tracking_number": "DPD123456789GB",
"tracking_url": "https://track.dpd.co.uk/..."
}
}
Note: Tracking info is sourced from nested payload fields at payload.shipment.tracking_no and payload.shipment.tracking_url.
Retry Logic¶
| Attempt | Backoff | Cumulative Wait |
|---|---|---|
| 1 | Immediate (trigger fires) | 0 |
| 2 | 5 minutes | 5 min |
| 3 | 15 minutes | 20 min |
| 4 | 1 hour | 1 hr 20 min |
| 5 | 6 hours | 7 hr 20 min |
After 5 failed attempts, the event is marked notification_status = 'failed' and an alert is sent to Slack.
Notification Status Values¶
| Status | Meaning |
|---|---|
| NULL | Not a notification event, or not yet attempted |
| pending | Awaiting delivery or retry |
| sent | Successfully delivered to Customer.io |
| failed | Exhausted all retry attempts |
| skipped | No notification needed (e.g. no email in payload) |
Database Schema¶
Tables¶
raw_ops.courier_events — Complete log of all courier events
-- Key columns (inspect with \d raw_ops.courier_events for full schema)
-- shipment_id UUID REFERENCES raw_ops.shipments(id)
-- event_code TEXT
-- event_type TEXT
-- event_text TEXT
-- event_time TIMESTAMPTZ
-- requires_customer_notification BOOLEAN
-- raw_payload JSONB
-- created_at TIMESTAMPTZ DEFAULT now()
raw_ops.courier_event_mapping — Maps DPD event codes to notification templates
-- courier_name TEXT (e.g. 'DPD')
-- event_code TEXT
-- event_type TEXT
-- notification_template_id TEXT
-- requires_customer_notification BOOLEAN
-- priority TEXT
raw_ops.ops_events — Notification queue (courier events create CUSTOMER_EVENT rows)
Notification tracking columns (added in v4.0):
-- notification_attempts INTEGER DEFAULT 0
-- notification_last_error TEXT
-- notification_status TEXT DEFAULT NULL (NULL, pending, sent, failed, skipped)
-- notification_next_retry_at TIMESTAMPTZ
Indexes¶
-- Shipment lookups during event processing
CREATE INDEX idx_courier_events_shipment_id
ON raw_ops.courier_events(shipment_id);
-- Time-based queries for recent exceptions
CREATE INDEX idx_courier_events_created_at
ON raw_ops.courier_events(created_at DESC);
-- Monitoring unprocessed events (partial index)
CREATE INDEX idx_ops_events_courier_unprocessed
ON raw_ops.ops_events(created_at DESC)
WHERE processed_at IS NULL
AND kind = 'CUSTOMER_EVENT'
AND customer_event_type = 'courier_exception';
-- Event code mapping lookups
CREATE INDEX idx_courier_event_mapping_code
ON raw_ops.courier_event_mapping(courier_name, event_code);
-- Idempotency guard (v4.0)
CREATE UNIQUE INDEX idx_courier_events_idempotency
ON raw_ops.courier_events(shipment_id, event_code, event_time);
Row Level Security¶
All tables have RLS enabled. No policies defined. Service role bypasses RLS. Anon role blocked.
| Table | RLS Enabled |
|---|---|
| courier_events | ✅ |
| courier_event_mapping | ✅ |
| shipments | ✅ |
| ops_events | ✅ |
Core Processing Function¶
-- raw_ops.process_courier_event(
-- p_shipment_id UUID,
-- p_courier_name TEXT,
-- p_event_code TEXT,
-- p_event_text TEXT,
-- p_event_time TIMESTAMPTZ,
-- p_raw_payload JSONB
-- )
--
-- Called by Make.com inbound webhook. Performs:
-- 1. Looks up event code in courier_event_mapping
-- 2. Logs event to courier_events
-- 3. Retrieves customer and shipment data
-- 4. Creates CUSTOMER_EVENT in ops_events with full notification payload
-- 5. Returns notification instructions
Trigger Function¶
-- raw_ops.notify_courier_exception()
--
-- Trigger: AFTER INSERT ON raw_ops.ops_events
-- Condition: NEW.kind = 'CUSTOMER_EVENT' AND NEW.customer_event_type = 'courier_exception'
--
-- Action: Calls send-courier-notification Edge Function via pg_net
-- with { "event_id": NEW.id }
Retry Processor¶
-- raw_ops.fn_retry_courier_notifications_v1()
--
-- Schedule: Every 5 minutes via pg_cron
-- Picks up: ops_events where
-- kind = 'CUSTOMER_EVENT'
-- customer_event_type = 'courier_exception'
-- processed_at IS NULL
-- notification_status IS NULL OR 'pending'
-- notification_attempts < 5
-- notification_next_retry_at IS NULL OR <= now()
-- created_at > now() - interval '24 hours'
--
-- Processes up to 20 events per run (SKIP LOCKED to prevent double-firing)
-- Capacity: 240 retries/hour (sufficient for Phase A volumes)
Monitoring¶
Health Check: courier_watchdog_health¶
Function: raw_ops.fn_check_courier_watchdog_health_v2()
Schedule: Every 15 minutes via run-monitor
Pattern: SOP-MON-01 (pg_cron → pg_net → run-monitor → PostgreSQL function → monitoring_runs → ops-alerter)
Checks:
- Stuck events: Unprocessed
CUSTOMER_EVENTcourier exceptions older than 1 hour (within 24-hour lookback) - Permanently failed notifications: Events with
notification_status = 'failed'in the last 24 hours
Alert Thresholds:
| Condition | Severity | Channel |
|---|---|---|
| 3+ failed notifications in 24h | critical | #ops-urgent |
| 1-2 failed notifications in 24h | warning | #ops-alerts |
| Events stuck for more than 60 minutes | warning | #ops-alerts |
| No issues | No alert | — |
Verify monitoring is running:
SELECT check_name, status, duration_ms, result_summary, run_at
FROM raw_ops.monitoring_runs
WHERE check_name = 'courier_watchdog_health'
ORDER BY run_at DESC
LIMIT 5;
pg_cron Jobs¶
| Job Name | Schedule | Purpose |
|---|---|---|
dlv-01-retry-notifications |
*/5 * * * * |
Retry pending notifications with exponential backoff |
monitor-dlv-01-watchdog-health |
*/15 * * * * |
Health check via run-monitor |
View jobs:
SELECT jobid, jobname, schedule, active
FROM cron.job
WHERE jobname LIKE 'dlv-01%' OR jobname LIKE 'monitor-dlv-01%'
ORDER BY jobname;
Operational Queries¶
Check Unprocessed Events¶
SELECT
id,
customer_event_type,
message,
notification_status,
notification_attempts,
notification_last_error,
notification_next_retry_at,
created_at,
event_payload->>'email' AS email
FROM raw_ops.ops_events
WHERE processed_at IS NULL
AND kind = 'CUSTOMER_EVENT'
AND customer_event_type = 'courier_exception'
ORDER BY created_at DESC;
View Recent Courier Exceptions¶
SELECT
ce.event_code,
ce.event_type,
ce.event_text,
ce.event_time,
s.tracking_no,
c.email
FROM raw_ops.courier_events ce
JOIN raw_ops.shipments s ON ce.shipment_id = s.id
JOIN raw_ops.orders o ON s.order_id = o.id
JOIN raw_ops.customers c ON o.customer_id = c.id
WHERE ce.created_at > NOW() - INTERVAL '7 days'
ORDER BY ce.event_time DESC;
Exception Summary by Type¶
SELECT
event_type,
COUNT(*) AS total_events,
COUNT(*) FILTER (WHERE requires_customer_notification) AS notified
FROM raw_ops.courier_events
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY event_type
ORDER BY total_events DESC;
Notification Delivery Success Rate¶
SELECT
notification_status,
COUNT(*) AS count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM raw_ops.ops_events
WHERE kind = 'CUSTOMER_EVENT'
AND customer_event_type = 'courier_exception'
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY notification_status
ORDER BY count DESC;
Average Time-to-Notify¶
SELECT
ROUND(AVG(EXTRACT(EPOCH FROM (processed_at - created_at))), 1) AS avg_seconds,
ROUND(MAX(EXTRACT(EPOCH FROM (processed_at - created_at))), 1) AS max_seconds,
COUNT(*) AS sample_size
FROM raw_ops.ops_events
WHERE kind = 'CUSTOMER_EVENT'
AND customer_event_type = 'courier_exception'
AND notification_status = 'sent'
AND created_at > NOW() - INTERVAL '30 days';
Troubleshooting¶
Event Not Sending to Customer.io¶
Check 1: Is the event in ops_events?
SELECT id, notification_status, notification_attempts, notification_last_error, processed_at
FROM raw_ops.ops_events
WHERE customer_event_type = 'courier_exception'
ORDER BY created_at DESC LIMIT 5;
Check 2: Did the Edge Function fire?
Go to Supabase Dashboard → Edge Functions → send-courier-notification → Logs. Check for recent invocations and any errors.
Check 3: Is the retry processor running?
SELECT *
FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'dlv-01-retry-notifications')
ORDER BY start_time DESC LIMIT 5;
Check 4: Is the notification permanently failed?
If notification_status = 'failed' and notification_attempts = 5, the event has exhausted retries. Check notification_last_error for the root cause. Common causes:
- Customer.io API credentials invalid or expired
- Customer.io API rate limited
- Invalid email address in payload
- Customer.io campaign paused or deleted
Manual re-trigger (after fixing root cause):
UPDATE raw_ops.ops_events
SET
notification_status = 'pending',
notification_attempts = 0,
notification_last_error = NULL,
notification_next_retry_at = NULL
WHERE id = '[EVENT_ID]';
The retry processor will pick it up within 5 minutes.
Email Not Sending from Customer.io¶
Check 1: Campaign "Courier Exception Handler" is active in Customer.io
Check 2: Event appears in the customer's activity log in Customer.io
Check 3: Multi-split branch condition matches: Event attribute notification_template_id equals the expected value
Wrong Email Template Sent¶
SELECT * FROM raw_ops.courier_event_mapping
WHERE event_code = '[CODE]';
-- Fix:
UPDATE raw_ops.courier_event_mapping
SET notification_template_id = '[CORRECT_TEMPLATE]'
WHERE event_code = '[CODE]';
Testing¶
Create Test Event¶
-- Get a real shipment ID
SELECT id FROM raw_ops.shipments LIMIT 1;
-- Insert test event
INSERT INTO raw_ops.ops_events (
kind, entity_type, entity_id, customer_event_type,
message, event_payload, created_at, processed_at
) VALUES (
'CUSTOMER_EVENT', 'shipment',
'[SHIPMENT_ID]'::uuid,
'courier_exception',
'TEST: Direct notification path',
jsonb_build_object(
'email', 'test@example.com',
'first_name', 'Test',
'notification_template_id', 'missed_delivery_retry',
'event_type', 'customer_not_home',
'event_description', 'Testing direct Customer.io path',
'shipment', jsonb_build_object(
'tracking_no', 'TEST123456789GB',
'tracking_url', 'https://track.dpd.co.uk/test'
)
),
NOW(), NULL
);
Verify¶
processed_atshould be set within 10 secondsnotification_statusshould be'sent'notification_attemptsshould be1- Customer.io shows the event in the test customer's activity (if using a real email)
Cleanup¶
Manual Interventions¶
High Priority Events (approximately 2% of exceptions)¶
Address Issue (Code 03): - Customer receives automated email asking to confirm address - Action: Monitor replies, manually update DPD if needed - SLA: Respond within 2 hours - Volume at 100k customers: approximately 600/year (50/month)
Damaged Package (Code 12): - Customer receives automated "under review" email - Action: Manual assessment of damage, approve replacement/refund - SLA: Contact customer within 4 hours - Volume at 100k customers: approximately 300-500/year (25-42/month)
Customer Refused (Code 05): - No automatic email (rare edge case) - Action: Contact customer to understand reason - SLA: Reach out within 24 hours - Volume at 100k customers: fewer than 50/year
Success Metrics¶
Operational KPIs¶
| Metric | Target |
|---|---|
| Event processing time | Less than 10 seconds |
| Notification success rate | Greater than 99% |
| Manual intervention rate | Less than 2% |
| Query performance | Less than 10ms |
| Automated monitoring coverage | 100% |
Customer Experience KPIs¶
| Metric | Target | Tracking |
|---|---|---|
| Email open rate | Greater than 60% | Customer.io dashboard |
| Customer reply rate | Track for feedback | Manual review |
| Delivery success after exception | Greater than 95% | DPD final delivery data |
System Health KPIs¶
| Metric | Target | Alert Threshold |
|---|---|---|
| Stuck events | 0 | Any event stuck more than 1 hour |
| Notification failures | 0 | 1+ permanently failed in 24h |
| Edge Function errors | Less than 0.1% | Greater than 1% error rate |
| Customer.io delivery failures | Less than 0.1% | Greater than 1% failure rate |
Cost and Capacity¶
Assumptions: - Exception rate: 2-3% of shipments (industry standard) - Average customer: 12 deliveries/year (1 box/month) - Manual intervention rate: 2% of exceptions (high-priority only)
| Customers | Shipments/Year | Exceptions/Year | Exceptions/Day | Edge Function Calls | Customer.io Emails |
|---|---|---|---|---|---|
| 1,000 | 12,000 | 300-360 | 1 | ~360 | ~300 |
| 5,000 | 60,000 | 1,500-1,800 | 4-5 | ~1,800 | ~1,500 |
| 10,000 | 120,000 | 3,000-3,600 | 8-10 | ~3,600 | ~3,000 |
| 25,000 | 300,000 | 7,500-9,000 | 21-25 | ~9,000 | ~7,500 |
| 100,000 | 1,200,000 | 30,000-36,000 | 82-99 | ~36,000 | ~30,000 |
All volumes comfortably within Supabase Edge Function limits and Customer.io plan capacity. No architectural changes required at any scale.
Outstanding Items¶
DPD UK Direct API (Tier 2)¶
The inbound DPD event path currently runs through a Make.com webhook. This is the last Make.com dependency in the Courier Watchdog system.
Action: Confirm with DPD account manager (via Jamie) whether their UK API supports webhook callbacks (e.g. /status/events/subscribetoparcel). If yes, build a dpd-webhook-handler Edge Function and subscribe on shipment creation, removing Make.com from the inbound path entirely. If no, Make.com stays as the legitimate inbound bridge.
Metabase Dashboard¶
Build a Courier Watchdog dashboard with: - Exception patterns by type (30-day rolling) - Notification success rate - Average time-to-notify - Exception volume trends - Failed notification log
Endpoint Security (Phase B)¶
send-courier-notification currently runs with --no-verify-jwt. Acceptable for Phase A (internal calls only via pg_net). Harden with JWT verification or shared secret in Phase B.
Review Schedule¶
Weekly: Analyse exception patterns by event code, review customer responses to emails, check for new DPD event codes.
Monthly: Review email open/click rates (target greater than 60%), update templates based on customer feedback, optimise event code mapping if needed.
Version History¶
| Version | Date | Changes | Author |
|---|---|---|---|
| 4.0 | 2026-03-19 | Removed Make.com from notification path, added retry with exponential backoff, rebuilt monitoring to SOP-MON-01 pattern, added notification state tracking, idempotency index, RLS, renamed to SOP-DLV-01, converted from PDF to markdown | Anton |
| 3.0 | 2025-11-02 | Switched from polling to real-time webhooks, added database indexes, added automated Slack monitoring, added damaged package template, 98% automation coverage | Anton |
| 2.0 | 2025-10-24 | Added automatic event processing function, integrated Customer.io Event Bridge, created 4 branded email templates, 95% automation coverage | Anton |
| 1.0 | 2025-10-21 | Initial pilot version, manual event logging, 50-60% automation coverage, basic notification system | Anton |