Monitoring

This guide covers the observability and monitoring stack for the eTeamups Platform (Zeswa Platform), including distributed tracing, health check endpoints, Docker health checks, and resource monitoring.

OpenTelemetry Integration

The platform uses OpenTelemetry (OTEL) to collect distributed traces and logs from all application services. The telemetry data is exported to a SigNoz backend for visualization and alerting.

How It Works

The instrumentation is initialized in instrumentation.ts at the root of the repository. When a service starts, the OpenTelemetry Node SDK is loaded and begins collecting:

  • Distributed traces – HTTP requests, database queries, and inter-service calls are automatically instrumented.
  • Logs – Winston log records are correlated with trace context and exported via OTLP.

The SDK configures the following exporters:

Exporter Protocol Endpoint Environment Variable
Trace exporter OTLP/HTTP OTEL_EXPORTER_OTLP_ENDPOINT (appends /v1/traces)
Log exporter OTLP/HTTP OTEL_EXPORTER_OTLP_LOGS_ENDPOINT (appends /v1/logs)

Each service identifies itself using the OTEL_SERVICE_NAME environment variable (defaults to zeswa-platform).

Auto-Instrumentation

The SDK uses @opentelemetry/auto-instrumentations-node which automatically instruments:

  • Express HTTP handlers
  • Outgoing HTTP/HTTPS requests
  • MongoDB driver calls
  • Redis client calls

File system instrumentation is disabled by default to reduce noise.

Winston Log Correlation

The WinstonInstrumentation from @opentelemetry/instrumentation-winston injects trace and span IDs into every Winston log record. This allows you to jump from a log entry in SigNoz directly to the associated trace.

Each log record is annotated with resource.service.name matching the value of OTEL_SERVICE_NAME.

Graceful Shutdown

The SDK listens for SIGTERM and flushes all pending telemetry data before the process exits. This ensures that traces and logs from the final requests are not lost during deployments or container restarts.

SigNoz Dashboards

SigNoz serves as the observability backend. It receives OTLP data and provides:

  • Trace Explorer – Search and filter distributed traces across all microservices.
  • Log Explorer – Query structured logs with full-text search and attribute filtering.
  • Service Map – Visualize dependencies between auth, profile, organisation, media, and message-queue services.
  • Dashboards – Build custom dashboards for latency percentiles, error rates, and throughput.

Create the following dashboards in SigNoz for production monitoring:

  1. Service Overview – P50/P95/P99 latency, request rate, and error rate per service.
  2. Database Performance – MongoDB query latency, connection pool usage, slow queries.
  3. API Gateway – Request volume by endpoint (/auth, /profile, /organisation, /media), status code distribution.
  4. Error Tracking – Error rate trends, top error messages grouped by service.

Health Check Endpoints

The platform exposes health check endpoints at multiple layers.

Application Health Checks

Every microservice registers two health endpoints in libs/server/src/Server.ts:

Endpoint Response Purpose
GET /_ping 200 (empty body) Private liveness probe. Returns only a status code with no server metadata.
GET /health 200 {"status": "ok"} Container health probe. Used by Docker and load balancers.

Nginx Health Check

The Nginx reverse proxy exposes its own health endpoint on port 80:

GET /health -> 200 "healthy\n"

This endpoint is defined in the HTTP server block in nginx/conf.d/eteamups.conf and has access_log off to avoid polluting logs with probe requests.

Each HTTPS virtual host (Hub, Admin, API) also exposes /health returning the same response.

API Gateway Service Health

Through the Nginx API gateway at api.zeswa.com, each service health can be checked via:

URL Upstream
https://api.zeswa.com/auth/health auth-service:9000
https://api.zeswa.com/profile/health profile-service:9100
https://api.zeswa.com/organisation/health organisation-service:9107
https://api.zeswa.com/media/health media-service:9102

Health Check Script

The scripts/health-check.sh script performs a comprehensive check of the entire platform:

./scripts/health-check.sh

This script checks:

  1. Docker service status – Verifies all 10 containers are running and reports their health status (healthy, running, unhealthy).
  2. HTTP endpoints – Curls each public-facing URL and verifies the expected HTTP status code.
  3. Database connectivity – Runs mongosh ping and redis-cli ping inside the respective containers.
  4. Resource usage – Outputs docker stats for CPU and memory consumption.
  5. Disk usage – Runs docker system df to show image, container, and volume disk usage.

The script exits with code 0 if all checks pass, or 1 with a count of failures and troubleshooting hints.

Configuration defaults can be overridden with environment variables:

API_GATEWAY_URL=https://api.zeswa.com
ADMIN_URL=https://admin.zeswa.com
HUB_URL=https://www.zeswa.com

These values are loaded from docker.env if present.

Docker Health Checks

Every service in docker-compose.yml has a health check configured. These health checks are used by Docker to determine container health and by depends_on conditions to control startup order.

Health Check Configuration by Service

Service Test Command Interval Timeout Retries Start Period
MongoDB mongosh --eval "db.adminCommand('ping')" 30s 10s 3 40s
Redis redis-cli --raw incr ping 30s 10s 3 20s
Auth Service wget --spider http://127.0.0.1:9000/health 30s 10s 3 40s
Profile Service wget --spider http://127.0.0.1:9100/health 30s 10s 3 40s
Organisation Service wget --spider http://127.0.0.1:9107/health 30s 10s 3 40s
Media Service wget --spider http://127.0.0.1:9102/health 30s 10s 3 40s
Message Queue pgrep -f node 30s 10s 3 40s
Admin Portal wget --spider http://127.0.0.1:3000/ 30s 10s 3 40s
Zeswa Hub wget --spider http://127.0.0.1:4000/ 30s 10s 3 40s
Nginx wget --spider http://127.0.0.1/health 30s 10s 3 20s

Startup Order

Docker Compose uses depends_on with condition: service_healthy to enforce startup order:

  1. MongoDB and Redis start first.
  2. Application services (auth, profile, organisation, media, message-queue) start after MongoDB and Redis are healthy.
  3. Frontend apps (admin-portal, zeswa-hub) start after all application services are healthy.
  4. Nginx starts last, after all upstream services are healthy.

Restart Policies

Application services use a restart policy with backoff:

restart_policy:
  condition: on-failure
  delay: 5s
  max_attempts: 3
  window: 120s

Infrastructure services (MongoDB, Redis, Nginx) use restart: always.

Resource Monitoring

Docker Stats

Monitor real-time resource consumption across all containers:

docker stats

For a one-time snapshot:

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

Resource Limits

Each service has defined resource limits and reservations in docker-compose.yml:

Service CPU Limit Memory Limit CPU Reservation Memory Reservation
MongoDB 2.0 2 GB 0.5 512 MB
Redis 1.0 768 MB 0.25 256 MB
Auth Service 1.0 512 MB 0.25 128 MB
Profile Service 1.0 512 MB 0.25 128 MB
Organisation Service 1.0 512 MB 0.25 128 MB
Media Service 1.0 1 GB 0.25 256 MB
Message Queue 0.5 512 MB 0.1 128 MB
Admin Portal 1.0 512 MB 0.25 128 MB
Zeswa Hub 1.0 512 MB 0.25 128 MB
Nginx 0.5 256 MB 0.1 64 MB

Docker Log Size Limits

All containers use the json-file logging driver with size limits:

  • Infrastructure services (MongoDB, Redis): max-size: 10m, max-file: 3
  • Application services: max-size: 10m, max-file: 5

Disk Usage

Check Docker disk usage:

docker system df

For detailed volume usage:

docker system df -v

Key Metrics to Watch

Application Metrics (via SigNoz)

  • Request latency – P95 latency per service. Alert if consistently above 500ms.
  • Error rate – Percentage of 5xx responses. Alert if above 1%.
  • Request throughput – Requests per second per service. Watch for unexpected drops.
  • Database query latency – MongoDB operation duration. Alert if above 100ms for reads.

Infrastructure Metrics

  • Container CPU usage – Alert if any service sustains above 80% of its limit.
  • Container memory usage – Alert if any service exceeds 80% of its allocated memory.
  • Disk space – Monitor Docker volumes (especially mongodb_data and media_uploads). Alert below 10% free.
  • Redis memory – Redis is configured with maxmemory 512mb and allkeys-lru eviction. Monitor memory usage approaching the limit.

Nginx Metrics

  • Request rate – Extracted from access logs. Use scripts/log-monitor.sh -p for a per-hour breakdown.
  • 4xx/5xx rates – Monitor 404-errors.log and server-errors.log for spikes.
  • Upstream response time – Available in the detailed log format via the urt field.
  • Rate limit rejections – 429 status codes indicate rate limiting is being triggered.

Alerting Recommendations

Suggested Alert Rules

Alert Condition Severity Channel
Service Down Health check fails 3 consecutive times Critical PagerDuty / Slack
High Error Rate 5xx rate > 1% for 5 minutes Critical Slack
High Latency P95 > 2s for 5 minutes Warning Slack
Memory Pressure Container memory > 85% of limit Warning Slack
Disk Space Low Volume usage > 90% Warning Email
MongoDB Slow Query Query duration > 500ms Warning Slack
Redis Memory High Memory usage > 400 MB (of 512 MB limit) Warning Slack
Rate Limit Spike 429 responses > 100/min Info Slack
SSL Certificate Expiry Certificate expires within 14 days Warning Email

Setting Up Alerts in SigNoz

  1. Navigate to the Alerts section in the SigNoz UI.
  2. Create a new alert rule using the PromQL or ClickHouse query builder.
  3. Set the threshold, evaluation interval, and notification channel.
  4. Configure notification channels (Slack webhook, PagerDuty, email) under Settings > Alert Channels.

External Health Monitoring

For external uptime monitoring, configure an external service (e.g., UptimeRobot, Pingdom) to periodically check:

  • https://api.zeswa.com/health
  • https://www.zeswa.com/health
  • https://admin.zeswa.com/health

These endpoints return 200 with "healthy\n" when Nginx and its upstream services are operational.

Periodic Monitoring Tasks

Task Frequency Command / Action
Run full health check Every deployment, daily ./scripts/health-check.sh
Review error logs Daily ./scripts/log-monitor.sh -e
Check resource usage Daily docker stats --no-stream
Review security events Daily ./scripts/log-monitor.sh -t
Verify backup integrity Weekly Check backups/ directory contents
Review SigNoz dashboards Weekly Check latency trends, error rate trends
Clean up Docker resources Monthly docker system prune (with caution)
Review disk usage Monthly docker system df -v