SimpleNS LogoDocs

Architecture

Understanding SimpleNS's system architecture, components, and design decisions.


High-Level Architecture

SimpleNS HLD Diagram

SimpleNS follows an event-driven, microservices-inspired architecture where each component has a specific responsibility. The system is built around the outbox pattern for reliability and uses Kafka for event streaming.

Deployment Architecture

SimpleNS uses split Docker Compose files for flexible deployment:

  • docker-compose.yaml - Application services (uses pre-built images from GHCR)
  • docker-compose.infra.yaml - Infrastructure services (MongoDB, Kafka, Redis, Loki)
  • docker-compose.dev.yaml - All-in-one development environment

See the Self-Hosting Guide for deployment options.

Core Principles

Separation of Concerns

Orchestration (SimpleNS Core) handles retries, rate limiting, scheduling, recovery. Delivery (Plugins) handles actual notification sending via provider APIs.

Plugin-Based Extensibility

Swap providers without changing application code. Community-driven ecosystem with easy custom integrations.

Horizontal Scalability

Scale each component independently with partition-based parallelism and load distribution across workers.

Fault Tolerance

Outbox pattern prevents message loss. Crash recovery for stuck notifications. Automatic retries with exponential backoff.

System Components

API Server

Responsibility: REST API for notification ingestion

Key Features:

  • /api/notification - Send a single notification to a single recipient through multiple channels
  • /api/notification/batch - Send the same notification to multiple recipients through multiple channels (max-limit per request configurable in (.env))
  • Request validation using Zod schemas
  • Bearer token authentication

Technology: Express.js, TypeScript

Scaling: Multiple instances behind load balancer

Data Flow:

Background Worker

Responsibility: Polls outbox, polls outbox_status, publishes to Kafka, updates status

Key Features:

  • Polls MongoDB outbox table every 5 seconds (configurable)
  • Publishes notifications from outbox collection to appropriate Kafka topics
  • Consumes status updates from notification_status kafka topic
  • Updates notification status in MongoDB
  • Sends webhook callbacks
  • Handles worker crashes via claim timeouts
  • Polls MongoDB Outobx Status collection every 5 seconds (configurable) and publishes status to notification_status kafka topic. (Auto resolution of ghost deliveries)

Technology: Node.js, MongoDB, Kafka Producer/Consumer

Scaling: Multiple worker instances (distributed claiming prevents duplicates)

Data Flow:

Ghost Delivery and Permanently Failed Status Update Flow

Unified Notification Processor

Responsibility: Plugin-based notification delivery with rate limiting and automatic fallback

Key Features:

  • Loads plugins from .plugins/ directory based on simplens.config.yaml
  • Consumes from channel-specific Kafka topics
  • Per-provider rate limiting (Token Bucket Algorithm)
  • Exponential backoff retries (configurable, default: 5 attempts)
  • Automatic fallback to secondary provider when default provider fails with non-retryable error
  • Schema validation against fallback provider before attempting
  • Idempotency using Redis cache
  • Processing locks via Redis (TTL: 2 minutes)
  • Publishes delivery status to notification_status topic

Technology: Node.js, Kafka Consumer, Redis, Plugin SDK

Scaling:

  • Multiple processor instances per channel
  • Kafka partition-based parallelism
  • Can run channel-specific processors (e.g., only email)

Configuration:

PROCESSOR_CHANNEL=all  # or 'email', 'sms', etc.
MAX_RETRY_COUNT=5
PROCESSING_TTL_SECONDS=120

Fallback Provider Logic:

  1. Try default provider for channel
  2. If fails with non-retryable error, validate against fallback provider schema
  3. If validation passes, try fallback provider
  4. If fallback fails or both fail, mark as failed

If the error is retryable (e.g., rate limit, timeout), SimpleNS retries with the same provider using exponential backoff instead of falling back.

Data Flow:

Delayed Processor

Responsibility: Handles scheduled notifications using two-phase commit

Key Features:

  • Redis ZSET-based delay queue (score = Unix timestamp)
  • Two-phase commit prevents message loss during crashes
  • Polls every 1 second for due notifications (configurable)
  • Fetches batch of due notifications (default: 10)
  • Publishes to appropriate channel topics
  • Handles poller failures with retries and exponential backoff

Two-Phase Commit Implementation:

  1. Claim Phase: Atomically lock events for this worker using SET NX (prevents duplicate processing)
  2. Process Phase: Publish to target Kafka topic
  3. Confirm Phase: Remove from queue ONLY after successful publish

If a worker crashes between claim and confirm, the claim expires after 60 seconds and another worker can pick up the event.

Technology: Node.js, Redis ZSET + Lua Scripts, Kafka Producer

Scaling: Multiple Delayed Processor instances (distributed claiming prevents duplicates)

Data Flow:

Recovery Service

Responsibility: Detects orphaned/stuck notifications and creates alerts

Key Features:

  • Runs every 60 seconds (configurable)
  • Detects notifications stuck in processing state (timeout: 5 minutes)
  • Detects notifications stuck in pending state (timeout: 5 minutes)
  • Creates alerts in MongoDB for manual intervention
  • Cleanup of resolved alerts (retention: 24 hours)
  • Cleanup of processed status outbox entries

Technology: Node.js, MongoDB, Cron-like polling

Scaling: Single instance (uses distributed locks for multi-instance support)

Alert Types:

  • ghost_delivery - Status mismatch between Redis and MongoDB (notification delivered but status not updated)
  • stuck_processing - Notification stuck in processing state beyond threshold
  • orphaned_pending - Notification never picked up by processor

Data Flow:

Recovery Flow

Admin Dashboard

Responsibility: Web-based monitoring and management

Key Features:

  • Dashboard home with statistics
  • Events explorer with search and filtering
  • Send page to send single and batch notifications from the admin dashboard
  • Failed events page with retry capabilities
  • Alerts management
  • Analytics and charts
  • Plugins registry view
  • Payload Studio for API schema exploration
  • Authentication via NextAuth

Technology: Next.js, React, MongoDB direct queries, shadcn/ui

Scaling: Multiple instances (stateless, session in cookies)

Pages:

  • / - Dashboard home
  • /events - All notifications
  • /events/[id] - Notification details
  • /send - Send notifications from admin dashboard
  • /failed - Failed notifications
  • /alerts - System alerts
  • /analytics - Charts and graphs
  • /plugins - Installed plugins
  • /payload-studio - API schema builder

Infrastructure Components

MongoDB

Persistent storage for notifications, outbox, and alerts. Replica set required.

Kafka

Event streaming and message queue with channel-specific topics.

Redis

Caching, delay queue, rate limiting, and processing locks.

Loki + Grafana

Centralized logging and visualization across all services.

MongoDB

Purpose: Persistent storage for notifications, outbox, alerts

Configuration:

  • Replica set required (minimum 1 node for dev, 3 for production)
  • Collections:
    • notifications - Notification documents
    • outbox - Outbox pattern entries
    • alerts - Recovery alerts
    • status_outbox - Status update outbox
  • Indexes for performance on status, created_at, notification_id

Kafka

Purpose: Event streaming and message queue

Topics:

  • {channel}_notification - Channel-specific (e.g., email_notification, sms_notification)
  • delayed_notification - Scheduled notifications
  • notification_status - Delivery status updates

Partitioning Strategy:

  • Channel topics: Configurable partitions (env: {CHANNEL}_PARTITION)
  • More partitions = more parallel consumers
  • Partition by notification_id hash for ordering within same notification

Configuration:

EMAIL_PARTITION=5
SMS_PARTITION=3
DELAYED_PARTITION=1
NOTIFICATION_STATUS_PARTITION=1

Redis

Purpose: Caching, delay queue, rate limiting

Use Cases:

1. Delay Queue (ZSET)

ZADD delay_queue <scheduled_timestamp> <notification_json>

2. Idempotency Cache

SET idempotency:<notification_id> <result> EX 86400

3. Rate Limiting (Token Bucket)

HSET rate_limit:<provider_id> tokens <count> last_refill <timestamp>

4. Processing Locks

SET processing:<notification_id> <worker_id> EX 120

Loki + Grafana

Purpose: Centralized logging and visualization

Configuration:

  • All services send logs to Loki via winston-loki
  • Grafana datasource configured for Loki
  • Labels: service, level, notification_id
  • Query examples:
{service="api"}
{service="notification-processor", level="error"}
{notification_id="abc123"}

Data Flow

Immediate Notification Flow

Immediate Notification Flow

Scheduled Notification Flow

Scheduled Notification Flow

Retry Flow

Retry Flow

Recovery Flow

Recovery Flow

Scalability Model

Horizontal Scaling

API Server

  • Run multiple instances behind load balancer (NGINX, ALB)
  • Stateless (no in-memory session)
  • Share same MongoDB and Kafka

Background Worker

  • Multiple instances supported
  • Distributed claiming via MongoDB (worker_id + claimed_at)
  • Claim timeout handles crashed workers

Notification Processor

  • Scale independently per channel
  • Kafka consumer group ensures no duplicate processing
  • Increase instances to reduce consumer lag

Delayed Processor

  • Typically 1-2 instances
  • Redis atomic operations prevent duplicates
  • Low CPU usage, minimal scaling needed

Scaling Guidelines

Metric: Kafka Consumer Lag

# Check in Kafka UI or CLI
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group notification-processor-email

Action: Scale up if lag > 1000

docker-compose up -d --scale notification-processor=5

Increase Kafka Partitions:

# Update .env
EMAIL_PARTITION=10

# Use kafka-topics.sh to add partitions
bin/kafka-topics.sh --bootstrap-server <broker_host>:<port> --topic <topic_name> --alter --partitions <new_total_number>

Processor per Channel:

# Instead of PROCESSOR_CHANNEL=all
# Run separate processors:
PROCESSOR_CHANNEL=email #update in .env
docker-compose up -d notification-processor

Performance Considerations

Increase Throughput:

  • Batch Size: Increase OUTBOX_BATCH_SIZE for high throughput
  • Kafka Partitions: Increase partitions for parallel processing
  • Worker Count: Scale processors horizontally
  • MongoDB: Add indexes on frequently queried fields

Reduce Latency:

  • Polling Interval: Reduce outbox polling interval (trade-off: DB load)
  • Processing TTL: Reduce lock TTL for faster failure detection
  • Network: Colocate services in same region/VPC
  • Redis: Use cluster mode for high cache throughput

Optimize Resources:

  • Rate Limits: Adjust per-provider maxTokens and refillRate
  • Connection Pools: Tune MongoDB and Redis connection pools
  • Memory: Monitor Kafka consumer memory for large message payloads
  • Disk: Configure Kafka retention policies

On this page