Architecture
Understanding SimpleNS's system architecture, components, and design decisions.
High-Level Architecture

SimpleNS follows an event-driven, microservices-inspired architecture where each component has a specific responsibility. The system is built around the outbox pattern for reliability and uses Kafka for event streaming.
Deployment Architecture
SimpleNS uses split Docker Compose files for flexible deployment:
docker-compose.yaml- Application services (uses pre-built images from GHCR)docker-compose.infra.yaml- Infrastructure services (MongoDB, Kafka, Redis, Loki)docker-compose.dev.yaml- All-in-one development environment
See the Self-Hosting Guide for deployment options.
Core Principles
Separation of Concerns
Orchestration (SimpleNS Core) handles retries, rate limiting, scheduling, recovery. Delivery (Plugins) handles actual notification sending via provider APIs.
Plugin-Based Extensibility
Swap providers without changing application code. Community-driven ecosystem with easy custom integrations.
Horizontal Scalability
Scale each component independently with partition-based parallelism and load distribution across workers.
Fault Tolerance
Outbox pattern prevents message loss. Crash recovery for stuck notifications. Automatic retries with exponential backoff.
System Components
API Server
Responsibility: REST API for notification ingestion
Key Features:
/api/notification- Send a single notification to a single recipient through multiple channels/api/notification/batch- Send the same notification to multiple recipients through multiple channels (max-limit per request configurable in (.env))- Request validation using Zod schemas
- Bearer token authentication
Technology: Express.js, TypeScript
Scaling: Multiple instances behind load balancer
Data Flow:
Background Worker
Responsibility: Polls outbox, polls outbox_status, publishes to Kafka, updates status
Key Features:
- Polls MongoDB outbox table every 5 seconds (configurable)
- Publishes notifications from outbox collection to appropriate Kafka topics
- Consumes status updates from notification_status kafka topic
- Updates notification status in MongoDB
- Sends webhook callbacks
- Handles worker crashes via claim timeouts
- Polls MongoDB Outobx Status collection every 5 seconds (configurable) and publishes status to notification_status kafka topic. (Auto resolution of ghost deliveries)
Technology: Node.js, MongoDB, Kafka Producer/Consumer
Scaling: Multiple worker instances (distributed claiming prevents duplicates)
Data Flow:
Ghost Delivery and Permanently Failed Status Update Flow
Unified Notification Processor
Responsibility: Plugin-based notification delivery with rate limiting and automatic fallback
Key Features:
- Loads plugins from
.plugins/directory based onsimplens.config.yaml - Consumes from channel-specific Kafka topics
- Per-provider rate limiting (Token Bucket Algorithm)
- Exponential backoff retries (configurable, default: 5 attempts)
- Automatic fallback to secondary provider when default provider fails with non-retryable error
- Schema validation against fallback provider before attempting
- Idempotency using Redis cache
- Processing locks via Redis (TTL: 2 minutes)
- Publishes delivery status to
notification_statustopic
Technology: Node.js, Kafka Consumer, Redis, Plugin SDK
Scaling:
- Multiple processor instances per channel
- Kafka partition-based parallelism
- Can run channel-specific processors (e.g., only email)
Configuration:
PROCESSOR_CHANNEL=all # or 'email', 'sms', etc.
MAX_RETRY_COUNT=5
PROCESSING_TTL_SECONDS=120Fallback Provider Logic:
- Try default provider for channel
- If fails with non-retryable error, validate against fallback provider schema
- If validation passes, try fallback provider
- If fallback fails or both fail, mark as
failed
If the error is retryable (e.g., rate limit, timeout), SimpleNS retries with the same provider using exponential backoff instead of falling back.
Data Flow:
Delayed Processor
Responsibility: Handles scheduled notifications using two-phase commit
Key Features:
- Redis ZSET-based delay queue (score = Unix timestamp)
- Two-phase commit prevents message loss during crashes
- Polls every 1 second for due notifications (configurable)
- Fetches batch of due notifications (default: 10)
- Publishes to appropriate channel topics
- Handles poller failures with retries and exponential backoff
Two-Phase Commit Implementation:
- Claim Phase: Atomically lock events for this worker using
SET NX(prevents duplicate processing) - Process Phase: Publish to target Kafka topic
- Confirm Phase: Remove from queue ONLY after successful publish
If a worker crashes between claim and confirm, the claim expires after 60 seconds and another worker can pick up the event.
Technology: Node.js, Redis ZSET + Lua Scripts, Kafka Producer
Scaling: Multiple Delayed Processor instances (distributed claiming prevents duplicates)
Data Flow:
Recovery Service
Responsibility: Detects orphaned/stuck notifications and creates alerts
Key Features:
- Runs every 60 seconds (configurable)
- Detects notifications stuck in
processingstate (timeout: 5 minutes) - Detects notifications stuck in
pendingstate (timeout: 5 minutes) - Creates alerts in MongoDB for manual intervention
- Cleanup of resolved alerts (retention: 24 hours)
- Cleanup of processed status outbox entries
Technology: Node.js, MongoDB, Cron-like polling
Scaling: Single instance (uses distributed locks for multi-instance support)
Alert Types:
ghost_delivery- Status mismatch between Redis and MongoDB (notification delivered but status not updated)stuck_processing- Notification stuck in processing state beyond thresholdorphaned_pending- Notification never picked up by processor
Data Flow:

Admin Dashboard
Responsibility: Web-based monitoring and management
Key Features:
- Dashboard home with statistics
- Events explorer with search and filtering
- Send page to send single and batch notifications from the admin dashboard
- Failed events page with retry capabilities
- Alerts management
- Analytics and charts
- Plugins registry view
- Payload Studio for API schema exploration
- Authentication via NextAuth
Technology: Next.js, React, MongoDB direct queries, shadcn/ui
Scaling: Multiple instances (stateless, session in cookies)
Pages:
/- Dashboard home/events- All notifications/events/[id]- Notification details/send- Send notifications from admin dashboard/failed- Failed notifications/alerts- System alerts/analytics- Charts and graphs/plugins- Installed plugins/payload-studio- API schema builder
Infrastructure Components
MongoDB
Persistent storage for notifications, outbox, and alerts. Replica set required.
Kafka
Event streaming and message queue with channel-specific topics.
Redis
Caching, delay queue, rate limiting, and processing locks.
Loki + Grafana
Centralized logging and visualization across all services.
MongoDB
Purpose: Persistent storage for notifications, outbox, alerts
Configuration:
- Replica set required (minimum 1 node for dev, 3 for production)
- Collections:
notifications- Notification documentsoutbox- Outbox pattern entriesalerts- Recovery alertsstatus_outbox- Status update outbox
- Indexes for performance on
status,created_at,notification_id
Kafka
Purpose: Event streaming and message queue
Topics:
{channel}_notification- Channel-specific (e.g.,email_notification,sms_notification)delayed_notification- Scheduled notificationsnotification_status- Delivery status updates
Partitioning Strategy:
- Channel topics: Configurable partitions (env:
{CHANNEL}_PARTITION) - More partitions = more parallel consumers
- Partition by
notification_idhash for ordering within same notification
Configuration:
EMAIL_PARTITION=5
SMS_PARTITION=3
DELAYED_PARTITION=1
NOTIFICATION_STATUS_PARTITION=1Redis
Purpose: Caching, delay queue, rate limiting
Use Cases:
1. Delay Queue (ZSET)
ZADD delay_queue <scheduled_timestamp> <notification_json>2. Idempotency Cache
SET idempotency:<notification_id> <result> EX 864003. Rate Limiting (Token Bucket)
HSET rate_limit:<provider_id> tokens <count> last_refill <timestamp>4. Processing Locks
SET processing:<notification_id> <worker_id> EX 120Loki + Grafana
Purpose: Centralized logging and visualization
Configuration:
- All services send logs to Loki via winston-loki
- Grafana datasource configured for Loki
- Labels:
service,level,notification_id - Query examples:
{service="api"}
{service="notification-processor", level="error"}
{notification_id="abc123"}Data Flow
Immediate Notification Flow

Scheduled Notification Flow

Retry Flow

Recovery Flow

Scalability Model
Horizontal Scaling
API Server
- Run multiple instances behind load balancer (NGINX, ALB)
- Stateless (no in-memory session)
- Share same MongoDB and Kafka
Background Worker
- Multiple instances supported
- Distributed claiming via MongoDB (
worker_id+claimed_at) - Claim timeout handles crashed workers
Notification Processor
- Scale independently per channel
- Kafka consumer group ensures no duplicate processing
- Increase instances to reduce consumer lag
Delayed Processor
- Typically 1-2 instances
- Redis atomic operations prevent duplicates
- Low CPU usage, minimal scaling needed
Scaling Guidelines
Metric: Kafka Consumer Lag
# Check in Kafka UI or CLI
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group notification-processor-emailAction: Scale up if lag > 1000
docker-compose up -d --scale notification-processor=5Increase Kafka Partitions:
# Update .env
EMAIL_PARTITION=10
# Use kafka-topics.sh to add partitions
bin/kafka-topics.sh --bootstrap-server <broker_host>:<port> --topic <topic_name> --alter --partitions <new_total_number>Processor per Channel:
# Instead of PROCESSOR_CHANNEL=all
# Run separate processors:
PROCESSOR_CHANNEL=email #update in .env
docker-compose up -d notification-processorPerformance Considerations
Increase Throughput:
- Batch Size: Increase
OUTBOX_BATCH_SIZEfor high throughput - Kafka Partitions: Increase partitions for parallel processing
- Worker Count: Scale processors horizontally
- MongoDB: Add indexes on frequently queried fields
Reduce Latency:
- Polling Interval: Reduce outbox polling interval (trade-off: DB load)
- Processing TTL: Reduce lock TTL for faster failure detection
- Network: Colocate services in same region/VPC
- Redis: Use cluster mode for high cache throughput
Optimize Resources:
- Rate Limits: Adjust per-provider
maxTokensandrefillRate - Connection Pools: Tune MongoDB and Redis connection pools
- Memory: Monitor Kafka consumer memory for large message payloads
- Disk: Configure Kafka retention policies
Docs