Core Concepts

Key concepts and design patterns used in SimpleNS for reliable notification delivery.

Notification Lifecycle

States

A notification progresses through the following states:

Status	Description
`pending`	Queued for delivery (includes scheduled notifications waiting for their delivery time)
`processing`	Currently being delivered by the unified processor
`delivered`	Successfully sent to the provider and recipient
`failed`	All retry attempts exhausted or non-retryable error

Scheduled notifications don't have a separate state. They remain in pending status with a scheduled_at field. The Delayed Processor monitors these and publishes them to the appropriate channel topic when the scheduled time arrives.

State Transitions

pending → processing → delivered
pending → processing → failed (after all retries exhausted or non-retryable error)

Viewing Notification State

Navigate to Events page
Search by notification ID, email, or other criteria
View current status and full history

{notification_id="<id>"}

Transactional Outbox Pattern

The transactional outbox pattern ensures reliable message publishing even during system failures.

Why Outbox?

Problem: Writing to database and publishing to Kafka in separate operations can fail partially:

Database write succeeds, Kafka publish fails → notification lost
Kafka publish succeeds, database write fails → duplicate notification

Solution: Transactional Outbox pattern guarantees atomicity and at-least-once delivery.

How It Works

API writes both notification and outbox entry in a MongoDB transaction

session.withTransaction(async () => {
  await notifications.insertOne(notification);
  await outbox.insertOne({
    notification_id,
    status: 'pending',
    topic: `${channel}_notification`
  });
});

Background worker polls outbox table

Polls every 5 seconds (configurable via OUTBOX_POLL_INTERVAL_MS):

const entries = await outbox.find({
  status: 'pending',
  $or: [
    { claimed_by: null },
    { claimed_at: { $lt: staleThreshold } }
  ]
}).limit(OUTBOX_BATCH_SIZE);

Worker claims entries (prevents duplicate processing)

Uses atomic findOneAndUpdate with worker ID:

await outbox.findOneAndUpdate(
  { _id: entry._id, claimed_by: null },
  { $set: { claimed_by: WORKER_ID, claimed_at: new Date() } }
);

Worker publishes to Kafka

Groups by topic and batch publishes:

await kafka.send({ 
  topic: entry.topic,  // e.g., 'email_notification'
  messages: [{ key: notification_id, value: payload }] 
});

Worker marks as published

Updates both outbox and notification status in a transaction:

await outbox.updateOne({ _id }, { status: 'published' });
await notification.updateOne({ _id }, { status: 'processing' });

Cleanup job removes old entries (configurable retention, default: 5 minutes)

Benefits

Atomicity: Database and message queue stay consistent
Reliability: Messages never lost even if worker crashes
Idempotency: Duplicate processing prevented via claiming with timeout
Observability: Outbox table shows publishing status

Retry Strategy

Exponential Backoff

When a notification delivery fails with a retryable error, SimpleNS automatically retries with increasing delays:

Delay (seconds) = base_delay * (2 ^ retry_count)

Retry 1: ~1 second
Retry 2: ~2 seconds
Retry 3: ~4 seconds
Retry 4: ~8 seconds
Retry 5: ~16 seconds

Configuration

MAX_RETRY_COUNT=5  # Maximum retry attempts (default: 5)

Retry Scenarios

Retries are triggered for:

Network errors (ECONNREFUSED, ETIMEDOUT)
Provider API errors (5xx responses)
Temporary failures (quota exceeded)
Rate limit exceeded (pushed to delayed queue)

No retries for:

Invalid credentials (permanent failure)
Validation errors (4xx responses)
Recipient not found
Schema validation failures

Non-retryable errors trigger the fallback provider logic if one is configured.

Retry Flow

Unified processor attempts delivery via plugin

Plugin returns { success: false, error: { message, retryable } }

Processor increments retry_count in the notification payload

If retry_count < MAX_RETRY_COUNT and error is retryable:

Calculate backoff delay
Publish to delayed_notification topic
Delayed processor re-queues for retry

Else:

Mark notification as failed
Update idempotency cache in Redis
Admin can manually retry from dashboard

Manual Retries

From the Dashboard:

Go to Failed page
Select notifications
Click Retry (individual or bulk up to 100)
Notifications re-enter the delivery pipeline

Rate Limiting

Token Bucket Algorithm

Each provider has a token bucket configured in simplens.config.yaml:

options:
  rateLimit:
    maxTokens: 100      # Bucket capacity
    refillRate: 10      # Tokens added per second

How It Works

Initialization: Bucket starts full (e.g., 100 tokens)

Consumption: Each notification consumes 1 token

Refill: Every second, add refillRate tokens (max = maxTokens)

Wait: If bucket empty, processor increments retry_count and pushes to delayed queue

Implementation (Redis Lua Script)

The rate limiter uses atomic Lua scripts for thread-safety:

-- Get current state
local current_tokens = tonumber(redis.call('GET', tokens_key)) or max_tokens
local last_refill = tonumber(redis.call('GET', last_refill_key)) or now

-- Calculate tokens to add based on elapsed time
local elapsed_seconds = (now - last_refill) / 1000
local tokens_to_add = elapsed_seconds * refill_rate
local new_tokens = math.min(current_tokens + tokens_to_add, max_tokens)

-- Try to consume a token
if new_tokens >= 1 then
  new_tokens = new_tokens - 1
  redis.call('SET', tokens_key, new_tokens)
  redis.call('SET', last_refill_key, now)
  return { 1, new_tokens }  -- allowed
else
  return { 0, new_tokens, wait_time }  -- denied
end

Default Configuration

If no rate limit is configured in the plugin, SimpleNS uses:

defaultRateLimit = {
  maxTokens: 100,
  refillRate: 10  // tokens per second
}

Benefits

Prevents provider rate limit errors
Maximizes throughput while staying within limits

Idempotency

Purpose

Prevent duplicate notifications when:

Client retries request (network error)
System crashes and retries
Kafka redelivers same message

Implementation

Idempotency is enforced using the auto-generated notification_id with atomic Lua scripts:

Redis Key Pattern:

idempotency:<notification_id>

Stored Record:

{
  status: 'processing' | 'delivered' | 'failed',
  retry_count: number,
  updated_at: ISO8601
}

Flow

Processor receives notification from Kafka

Atomic lock acquisition (Lua script):

const lockResult = await tryAcquireProcessingLock(notificationId, retry_count);
// Returns: { canProcess: true/false, isRetry: true/false }

The script:

Returns 0 if already delivered (skip)
Returns 0 if already processing (skip)
Returns 1 if first time (lock acquired)
Returns 2 if previously failed (retry allowed, lock acquired)

Deliver notification

Update cache on success:

await setDelivered(notificationId, retry_count);
// Sets key with TTL = IDEMPOTENCY_TTL_SECONDS

Update cache on failure:

await setFailed(notificationId, retry_count);
// Allows future retry attempts

TTL Configuration

IDEMPOTENCY_TTL_SECONDS=86400  # 24 hours (default)
PROCESSING_TTL_SECONDS=120    # 2 minutes lock timeout (default)

Increase IDEMPOTENCY_TTL_SECONDS for longer idempotency guarantees, decrease to save Redis memory.

Webhooks

Send real-time delivery status updates to your application.

Configuration

Include webhook_url in notification request:

{
  "webhook_url": "https://your-app.com/webhooks/notifications",
  "channel": ["email"],
  ...
}

Webhook Payload

SimpleNS sends a POST request to your webhook URL:

{
  "request_id": "req-456",
  "client_id": "client-123",
  "notification_id": "abc123",
  "status": "DELIVERED",
  "channel": "email",
  "message": "Notification sent successfully",
  "occurred_at": "2024-01-01T12:00:00.000Z"
}

Webhook Retries

If webhook delivery fails:

Retry with exponential backoff (3 attempts max)
Timeout: 30 seconds per attempt
5xx errors: Retry with backoff (1s, 2s, 4s)
4xx errors: No retry (client error)
Final failure logged but doesn't affect notification status

[!IMPORTANT] The notification status is updated in MongoDB regardless of webhook success. Webhook failures are logged but don't block the delivery pipeline.

Security

Verify webhook source:

Use HTTPS endpoints only
Whitelist SimpleNS IP addresses
Validate payload structure

Handle idempotency:

Process duplicate webhooks gracefully
Use notification_id for deduplication

Debugging

Check webhook logs:

docker-compose logs worker | grep webhook

Grafana query:

{service="worker"} |= "webhook"

Crash Recovery

The Recovery Service detects notifications stuck due to crashes or timeouts and creates alerts for manual inspection.

Detection

Every 60 seconds (configurable via RECOVERY_POLL_INTERVAL_MS), the service scans for:

1. Stuck Processing Notifications

status === 'processing' AND updated_at < now - PROCESSING_STUCK_THRESHOLD_MS

Cross-references with Redis idempotency status to detect ghost deliveries.

2. Orphaned Pending Notifications

status === 'pending' AND (
  // Non-scheduled: updated_at older than threshold
  (scheduled_at is null AND updated_at < threshold) OR
  // Scheduled: scheduled_at has passed and threshold exceeded
  (scheduled_at <= threshold)
)

3. Ghost Delivery Notifications

redis_status == 'delivered' AND notification_status == "processing"

Configuration

RECOVERY_POLL_INTERVAL_MS=60000            # Scan every 60 seconds
PROCESSING_STUCK_THRESHOLD_MS=300000       # 5 minutes
PENDING_STUCK_THRESHOLD_MS=300000          # 5 minutes
RECOVERY_BATCH_SIZE=50                     # Process 50 per scan
CLEANUP_RESOLVED_ALERTS_RETENTION_MS=86400000  # 24 hours

Recovery Logic

When stuck notification detected:

Claim notification atomically (prevents race conditions between recovery workers)

Check Redis idempotency status:

delivered → Ghost delivery! Auto-update MongoDB, create status outbox entry
failed with retries remaining → Create alert for manual retry
failed with max retries → Mark as failed, create status outbox entry
processing or no record → Create alert for manual inspection

Create alert (if not auto-resolved):

{
  alert_type: 'stuck_processing' | 'orphaned_pending',
  notification_id: '...',
  reason: 'Notification stuck in processing...',
  redis_status: 'processing',
  db_status: 'processing',
  retry_count: 2,
  resolved: false
}

Dashboard displays alerts for manual intervention

Resolution

Navigate to Alerts page
View alert details
Actions:
- Retry: Re-publish notification to Kafka
- Resolve: Mark alert as resolved (keeps notification failed)

Ghost deliveries: Auto-resolved when Redis status is delivered
Auto-resolve: Alerts for delivered/failed notifications are cleaned up
Cleanup: Resolved alerts deleted after 24 hours (configurable)

Why Crashes Happen

Worker process killed (OOM, SIGKILL)
Network partition
Kafka connection lost
Processing timeout exceeded

Recovery Service ensures no notification is permanently stuck. It uses distributed locking (recovery_claimed_by field) to support multiple recovery worker instances.

Scheduled Delivery

Queue notifications for future delivery using scheduled_at field.

Usage

{
  "scheduled_at": "2024-12-31T23:59:59Z",  // ISO 8601 UTC
  "channel": ["email"],
  ...
}

How It Works

API receives notification with scheduled_at

Background worker publishes to delayed_notification topic

Delayed processor consumes and adds to Redis ZSET

await redis.zadd('delayed_queue', scheduled_timestamp, JSON.stringify(event));

Poller runs every 1 second (two-phase commit):

Claim Phase: Atomically find and lock due events

-- Get due events
local events = redis.call('ZRANGEBYSCORE', queue_key, '-inf', now, 'LIMIT', 0, limit)
-- Claim each with worker ID
redis.call('SET', claim_key, worker_id, 'NX', 'EX', 60)

Process Phase: Publish to channel topic (e.g., email_notification)

Confirm Phase: Remove from queue ONLY after successful publish

-- Verify we still hold the claim
if redis.call('GET', claim_key) == worker_id then
    redis.call('ZREM', queue_key, event)
    redis.call('DEL', claim_key)
end

Unified processor delivers (same as immediate notification)

Poller Failure Handling

If publishing to Kafka fails:

Increment poller retry count
Re-add to queue with exponential backoff (5s, 10s, 20s, 40s, max 60s)
If max retries exceeded (MAX_POLLER_RETRIES=3), mark as failed

Configuration

DELAYED_POLL_INTERVAL_MS=1000   # Check every 1 second
DELAYED_BATCH_SIZE=10           # Process 10 at a time
MAX_POLLER_RETRIES=3            # Max retries before failure

Use Cases

Reminders: "Meeting in 1 hour"
Follow-ups: "Haven't heard from you in 3 days"
Campaigns: "Black Friday sale starts at midnight"
Drip campaigns: Series of emails over days/weeks

Core Concepts

On this page