Architecture

High-Level Architecture

SimpleNS HLD Diagram

SimpleNS follows an event-driven, microservices-inspired architecture where each component has a specific responsibility. The system is built around the outbox pattern for reliability and uses Kafka for event streaming.

Deployment Architecture

SimpleNS uses split Docker Compose files for flexible deployment:

docker-compose.yaml - Application services (uses pre-built images from GHCR)
docker-compose.infra.yaml - Infrastructure services (MongoDB, Kafka, Redis, Loki)
docker-compose.dev.yaml - All-in-one development environment

See the Self-Hosting Guide for deployment options.

Core Principles

Separation of Concerns

Orchestration (SimpleNS Core) handles retries, rate limiting, scheduling, recovery. Delivery (Plugins) handles actual notification sending via provider APIs.

Plugin-Based Extensibility

Swap providers without changing application code. Community-driven ecosystem with easy custom integrations.

Horizontal Scalability

Scale each component independently with partition-based parallelism and load distribution across workers.

Fault Tolerance

Outbox pattern prevents message loss. Crash recovery for stuck notifications. Automatic retries with exponential backoff.

System Components

API Server

Responsibility: REST API for notification ingestion

Key Features:

/api/notification - Send a single notification to a single recipient through multiple channels
/api/notification/batch - Send the same notification to multiple recipients through multiple channels (max-limit per request configurable in (.env))
Request validation using Zod schemas
Bearer token authentication

Technology: Express.js, TypeScript

Scaling: Multiple instances behind load balancer

Data Flow:

Background Worker

Responsibility: Polls outbox, polls outbox_status, publishes to Kafka, updates status

Key Features:

Polls MongoDB outbox table every 5 seconds (configurable)
Publishes notifications from outbox collection to appropriate Kafka topics
Consumes status updates from notification_status kafka topic
Updates notification status in MongoDB
Sends webhook callbacks
Handles worker crashes via claim timeouts
Polls MongoDB Outobx Status collection every 5 seconds (configurable) and publishes status to notification_status kafka topic. (Auto resolution of ghost deliveries)

Technology: Node.js, MongoDB, Kafka Producer/Consumer

Scaling: Multiple worker instances (distributed claiming prevents duplicates)

Data Flow:

Ghost Delivery and Permanently Failed Status Update Flow

Unified Notification Processor

Responsibility: Plugin-based notification delivery with rate limiting and automatic fallback

Key Features:

Loads plugins from .plugins/ directory based on simplens.config.yaml
Consumes from channel-specific Kafka topics
Per-provider rate limiting (Token Bucket Algorithm)
Exponential backoff retries (configurable, default: 5 attempts)
Automatic fallback to secondary provider when default provider fails with non-retryable error
Schema validation against fallback provider before attempting
Idempotency using Redis cache
Processing locks via Redis (TTL: 2 minutes)
Publishes delivery status to notification_status topic

Technology: Node.js, Kafka Consumer, Redis, Plugin SDK

Scaling:

Multiple processor instances per channel
Kafka partition-based parallelism
Can run channel-specific processors (e.g., only email)

Configuration:

PROCESSOR_CHANNEL=all  # or 'email', 'sms', etc.
MAX_RETRY_COUNT=5
PROCESSING_TTL_SECONDS=120

Fallback Provider Logic:

Try default provider for channel
If fails with non-retryable error, validate against fallback provider schema
If validation passes, try fallback provider
If fallback fails or both fail, mark as failed

If the error is retryable (e.g., rate limit, timeout), SimpleNS retries with the same provider using exponential backoff instead of falling back.

Data Flow:

Delayed Processor

Responsibility: Handles scheduled notifications using two-phase commit

Key Features:

Redis ZSET-based delay queue (score = Unix timestamp)
Two-phase commit prevents message loss during crashes
Polls every 1 second for due notifications (configurable)
Fetches batch of due notifications (default: 10)
Publishes to appropriate channel topics
Handles poller failures with retries and exponential backoff

Two-Phase Commit Implementation:

Claim Phase: Atomically lock events for this worker using SET NX (prevents duplicate processing)
Process Phase: Publish to target Kafka topic
Confirm Phase: Remove from queue ONLY after successful publish

If a worker crashes between claim and confirm, the claim expires after 60 seconds and another worker can pick up the event.

Technology: Node.js, Redis ZSET + Lua Scripts, Kafka Producer

Scaling: Multiple Delayed Processor instances (distributed claiming prevents duplicates)

Data Flow:

Recovery Service

Responsibility: Detects orphaned/stuck notifications and creates alerts

Key Features:

Runs every 60 seconds (configurable)
Detects notifications stuck in processing state (timeout: 5 minutes)
Detects notifications stuck in pending state (timeout: 5 minutes)
Creates alerts in MongoDB for manual intervention
Cleanup of resolved alerts (retention: 24 hours)
Cleanup of processed status outbox entries

Technology: Node.js, MongoDB, Cron-like polling

Scaling: Single instance (uses distributed locks for multi-instance support)

Alert Types:

ghost_delivery - Status mismatch between Redis and MongoDB (notification delivered but status not updated)
stuck_processing - Notification stuck in processing state beyond threshold
orphaned_pending - Notification never picked up by processor

Data Flow:

Recovery Flow

Admin Dashboard

Responsibility: Web-based monitoring and management

Key Features:

Dashboard home with statistics
Events explorer with search and filtering
Send page to send single and batch notifications from the admin dashboard
Failed events page with retry capabilities
Alerts management
Analytics and charts
Plugins registry view
Payload Studio for API schema exploration
Authentication via NextAuth

Technology: Next.js, React, MongoDB direct queries, shadcn/ui

Scaling: Multiple instances (stateless, session in cookies)

Pages:

/ - Dashboard home
/events - All notifications
/events/[id] - Notification details
/send - Send notifications from admin dashboard
/failed - Failed notifications
/alerts - System alerts
/analytics - Charts and graphs
/plugins - Installed plugins
/payload-studio - API schema builder

Infrastructure Components

MongoDB

Persistent storage for notifications, outbox, and alerts. Replica set required.

Kafka

Event streaming and message queue with channel-specific topics.

Redis

Caching, delay queue, rate limiting, and processing locks.

Loki + Grafana

Centralized logging and visualization across all services.

MongoDB

Purpose: Persistent storage for notifications, outbox, alerts

Configuration:

Replica set required (minimum 1 node for dev, 3 for production)
Collections:
- notifications - Notification documents
- outbox - Outbox pattern entries
- alerts - Recovery alerts
- status_outbox - Status update outbox
Indexes for performance on status, created_at, notification_id

Kafka

Purpose: Event streaming and message queue

Topics:

{channel}_notification - Channel-specific (e.g., email_notification, sms_notification)
delayed_notification - Scheduled notifications
notification_status - Delivery status updates

Partitioning Strategy:

Channel topics: Configurable partitions (env: {CHANNEL}_PARTITION)
More partitions = more parallel consumers
Partition by notification_id hash for ordering within same notification

Configuration:

EMAIL_PARTITION=5
SMS_PARTITION=3
DELAYED_PARTITION=1
NOTIFICATION_STATUS_PARTITION=1

Redis

Purpose: Caching, delay queue, rate limiting

Use Cases:

1. Delay Queue (ZSET)

ZADD delay_queue <scheduled_timestamp> <notification_json>

2. Idempotency Cache

SET idempotency:<notification_id> <result> EX 86400

3. Rate Limiting (Token Bucket)

HSET rate_limit:<provider_id> tokens <count> last_refill <timestamp>

4. Processing Locks

SET processing:<notification_id> <worker_id> EX 120

Loki + Grafana

Purpose: Centralized logging and visualization

Configuration:

All services send logs to Loki via winston-loki
Grafana datasource configured for Loki
Labels: service, level, notification_id
Query examples:

{service="api"}
{service="notification-processor", level="error"}
{notification_id="abc123"}

Run multiple instances behind load balancer (NGINX, ALB)
Stateless (no in-memory session)
Share same MongoDB and Kafka

Background Worker

Multiple instances supported
Distributed claiming via MongoDB (worker_id + claimed_at)
Claim timeout handles crashed workers

Notification Processor

Scale independently per channel
Kafka consumer group ensures no duplicate processing
Increase instances to reduce consumer lag

Delayed Processor

Typically 1-2 instances
Redis atomic operations prevent duplicates
Low CPU usage, minimal scaling needed

Scaling Guidelines

Metric: Kafka Consumer Lag

# Check in Kafka UI or CLI
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group notification-processor-email

Action: Scale up if lag > 1000

docker-compose up -d --scale notification-processor=5

Increase Kafka Partitions:

# Update .env
EMAIL_PARTITION=10

# Use kafka-topics.sh to add partitions
bin/kafka-topics.sh --bootstrap-server <broker_host>:<port> --topic <topic_name> --alter --partitions <new_total_number>

Processor per Channel:

# Instead of PROCESSOR_CHANNEL=all
# Run separate processors:
PROCESSOR_CHANNEL=email #update in .env
docker-compose up -d notification-processor

Performance Considerations

Increase Throughput:

Batch Size: Increase OUTBOX_BATCH_SIZE for high throughput
Kafka Partitions: Increase partitions for parallel processing
Worker Count: Scale processors horizontally
MongoDB: Add indexes on frequently queried fields

Reduce Latency:

Polling Interval: Reduce outbox polling interval (trade-off: DB load)
Processing TTL: Reduce lock TTL for faster failure detection
Network: Colocate services in same region/VPC
Redis: Use cluster mode for high cache throughput

Optimize Resources:

Rate Limits: Adjust per-provider maxTokens and refillRate
Connection Pools: Tune MongoDB and Redis connection pools
Memory: Monitor Kafka consumer memory for large message payloads
Disk: Configure Kafka retention policies

Architecture

High-Level Architecture

Core Principles

Separation of Concerns

Plugin-Based Extensibility

Horizontal Scalability

Fault Tolerance

System Components

API Server

Background Worker

Unified Notification Processor

Delayed Processor

Recovery Service

Admin Dashboard

Infrastructure Components

MongoDB

Kafka

Redis

Loki + Grafana

MongoDB

Kafka

Redis

Loki + Grafana

Data Flow

Immediate Notification Flow

Scheduled Notification Flow

Retry Flow

Recovery Flow

Scalability Model

Horizontal Scaling

API Server

Background Worker

Notification Processor

Delayed Processor

Scaling Guidelines

Performance Considerations

On this page