Operations
Monitoring

Monitoring

This guide covers the observability and monitoring stack for the eTeamups Platform (Zeswa Platform), including distributed tracing, health check endpoints, Docker health checks, and resource monitoring.

OpenTelemetry Integration

The platform uses OpenTelemetry (OTEL) to collect distributed traces and logs from all application services. The telemetry data is exported to a SigNoz backend for visualization and alerting.

How It Works

The instrumentation is initialized in instrumentation.ts at the root of the repository. When a service starts, the OpenTelemetry Node SDK is loaded and begins collecting:

Distributed traces – HTTP requests, database queries, and inter-service calls are automatically instrumented.
Logs – Winston log records are correlated with trace context and exported via OTLP.

The SDK configures the following exporters:

Exporter	Protocol	Endpoint Environment Variable
Trace exporter	OTLP/HTTP	`OTEL_EXPORTER_OTLP_ENDPOINT` (appends `/v1/traces`)
Log exporter	OTLP/HTTP	`OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` (appends `/v1/logs`)

Each service identifies itself using the OTEL_SERVICE_NAME environment variable (defaults to zeswa-platform).

Auto-Instrumentation

The SDK uses @opentelemetry/auto-instrumentations-node which automatically instruments:

Express HTTP handlers
Outgoing HTTP/HTTPS requests
MongoDB driver calls
Redis client calls

File system instrumentation is disabled by default to reduce noise.

Winston Log Correlation

The WinstonInstrumentation from @opentelemetry/instrumentation-winston injects trace and span IDs into every Winston log record. This allows you to jump from a log entry in SigNoz directly to the associated trace.

Each log record is annotated with resource.service.name matching the value of OTEL_SERVICE_NAME.

Graceful Shutdown

The SDK listens for SIGTERM and flushes all pending telemetry data before the process exits. This ensures that traces and logs from the final requests are not lost during deployments or container restarts.

SigNoz Dashboards

SigNoz serves as the observability backend. It receives OTLP data and provides:

Trace Explorer – Search and filter distributed traces across all microservices.
Log Explorer – Query structured logs with full-text search and attribute filtering.
Service Map – Visualize dependencies between auth, profile, organisation, media, and message-queue services.
Dashboards – Build custom dashboards for latency percentiles, error rates, and throughput.

Recommended Dashboards

Create the following dashboards in SigNoz for production monitoring:

Service Overview – P50/P95/P99 latency, request rate, and error rate per service.
Database Performance – MongoDB query latency, connection pool usage, slow queries.
API Gateway – Request volume by endpoint (/auth, /profile, /organisation, /media), status code distribution.
Error Tracking – Error rate trends, top error messages grouped by service.

Health Check Endpoints

The platform exposes health check endpoints at multiple layers.

Application Health Checks

Every microservice registers two health endpoints in libs/server/src/Server.ts:

Endpoint	Response	Purpose
`GET /_ping`	`200` (empty body)	Private liveness probe. Returns only a status code with no server metadata.
`GET /health`	`200 {"status": "ok"}`	Container health probe. Used by Docker and load balancers.

Nginx Health Check

The Nginx reverse proxy exposes its own health endpoint on port 80:

GET /health -> 200 "healthy\n"

This endpoint is defined in the HTTP server block in nginx/conf.d/eteamups.conf and has access_log off to avoid polluting logs with probe requests.

Each HTTPS virtual host (Hub, Admin, API) also exposes /health returning the same response.

API Gateway Service Health

Through the Nginx API gateway at api.zeswa.com, each service health can be checked via:

URL	Upstream
`https://api.zeswa.com/auth/health`	auth-service:9000
`https://api.zeswa.com/profile/health`	profile-service:9100
`https://api.zeswa.com/organisation/health`	organisation-service:9107
`https://api.zeswa.com/media/health`	media-service:9102

Health Check Script

The scripts/health-check.sh script performs a comprehensive check of the entire platform:

./scripts/health-check.sh

This script checks:

Docker service status – Verifies all 10 containers are running and reports their health status (healthy, running, unhealthy).
HTTP endpoints – Curls each public-facing URL and verifies the expected HTTP status code.
Database connectivity – Runs mongosh ping and redis-cli ping inside the respective containers.
Resource usage – Outputs docker stats for CPU and memory consumption.
Disk usage – Runs docker system df to show image, container, and volume disk usage.

The script exits with code 0 if all checks pass, or 1 with a count of failures and troubleshooting hints.

Configuration defaults can be overridden with environment variables:

API_GATEWAY_URL=https://api.zeswa.com
ADMIN_URL=https://admin.zeswa.com
HUB_URL=https://www.zeswa.com

These values are loaded from docker.env if present.

Docker Health Checks

Every service in docker-compose.yml has a health check configured. These health checks are used by Docker to determine container health and by depends_on conditions to control startup order.

Health Check Configuration by Service

Service	Test Command	Interval	Timeout	Retries	Start Period
MongoDB	`mongosh --eval "db.adminCommand('ping')"`	30s	10s	3	40s
Redis	`redis-cli --raw incr ping`	30s	10s	3	20s
Auth Service	`wget --spider http://127.0.0.1:9000/health`	30s	10s	3	40s
Profile Service	`wget --spider http://127.0.0.1:9100/health`	30s	10s	3	40s
Organisation Service	`wget --spider http://127.0.0.1:9107/health`	30s	10s	3	40s
Media Service	`wget --spider http://127.0.0.1:9102/health`	30s	10s	3	40s
Message Queue	`pgrep -f node`	30s	10s	3	40s
Admin Portal	`wget --spider http://127.0.0.1:3000/`	30s	10s	3	40s
Zeswa Hub	`wget --spider http://127.0.0.1:4000/`	30s	10s	3	40s
Nginx	`wget --spider http://127.0.0.1/health`	30s	10s	3	20s

Startup Order

Docker Compose uses depends_on with condition: service_healthy to enforce startup order:

MongoDB and Redis start first.
Application services (auth, profile, organisation, media, message-queue) start after MongoDB and Redis are healthy.
Frontend apps (admin-portal, zeswa-hub) start after all application services are healthy.
Nginx starts last, after all upstream services are healthy.

Restart Policies

Application services use a restart policy with backoff:

restart_policy:
  condition: on-failure
  delay: 5s
  max_attempts: 3
  window: 120s

Infrastructure services (MongoDB, Redis, Nginx) use restart: always.

Resource Monitoring

Docker Stats

Monitor real-time resource consumption across all containers:

docker stats

For a one-time snapshot:

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

Resource Limits

Each service has defined resource limits and reservations in docker-compose.yml:

Service	CPU Limit	Memory Limit	CPU Reservation	Memory Reservation
MongoDB	2.0	2 GB	0.5	512 MB
Redis	1.0	768 MB	0.25	256 MB
Auth Service	1.0	512 MB	0.25	128 MB
Profile Service	1.0	512 MB	0.25	128 MB
Organisation Service	1.0	512 MB	0.25	128 MB
Media Service	1.0	1 GB	0.25	256 MB
Message Queue	0.5	512 MB	0.1	128 MB
Admin Portal	1.0	512 MB	0.25	128 MB
Zeswa Hub	1.0	512 MB	0.25	128 MB
Nginx	0.5	256 MB	0.1	64 MB

Docker Log Size Limits

All containers use the json-file logging driver with size limits:

Infrastructure services (MongoDB, Redis): max-size: 10m, max-file: 3
Application services: max-size: 10m, max-file: 5

Disk Usage

Check Docker disk usage:

docker system df

For detailed volume usage:

docker system df -v

Key Metrics to Watch

Application Metrics (via SigNoz)

Request latency – P95 latency per service. Alert if consistently above 500ms.
Error rate – Percentage of 5xx responses. Alert if above 1%.
Request throughput – Requests per second per service. Watch for unexpected drops.
Database query latency – MongoDB operation duration. Alert if above 100ms for reads.

Infrastructure Metrics

Container CPU usage – Alert if any service sustains above 80% of its limit.
Container memory usage – Alert if any service exceeds 80% of its allocated memory.
Disk space – Monitor Docker volumes (especially mongodb_data and media_uploads). Alert below 10% free.
Redis memory – Redis is configured with maxmemory 512mb and allkeys-lru eviction. Monitor memory usage approaching the limit.

Nginx Metrics

Request rate – Extracted from access logs. Use scripts/log-monitor.sh -p for a per-hour breakdown.
4xx/5xx rates – Monitor 404-errors.log and server-errors.log for spikes.
Upstream response time – Available in the detailed log format via the urt field.
Rate limit rejections – 429 status codes indicate rate limiting is being triggered.

Alerting Recommendations

Suggested Alert Rules

Alert	Condition	Severity	Channel
Service Down	Health check fails 3 consecutive times	Critical	PagerDuty / Slack
High Error Rate	5xx rate > 1% for 5 minutes	Critical	Slack
High Latency	P95 > 2s for 5 minutes	Warning	Slack
Memory Pressure	Container memory > 85% of limit	Warning	Slack
Disk Space Low	Volume usage > 90%	Warning	Email
MongoDB Slow Query	Query duration > 500ms	Warning	Slack
Redis Memory High	Memory usage > 400 MB (of 512 MB limit)	Warning	Slack
Rate Limit Spike	429 responses > 100/min	Info	Slack
SSL Certificate Expiry	Certificate expires within 14 days	Warning	Email

Setting Up Alerts in SigNoz

Navigate to the Alerts section in the SigNoz UI.
Create a new alert rule using the PromQL or ClickHouse query builder.
Set the threshold, evaluation interval, and notification channel.
Configure notification channels (Slack webhook, PagerDuty, email) under Settings > Alert Channels.

External Health Monitoring

For external uptime monitoring, configure an external service (e.g., UptimeRobot, Pingdom) to periodically check:

https://api.zeswa.com/health
https://www.zeswa.com/health
https://admin.zeswa.com/health

These endpoints return 200 with "healthy\n" when Nginx and its upstream services are operational.

Periodic Monitoring Tasks

Task	Frequency	Command / Action
Run full health check	Every deployment, daily	`./scripts/health-check.sh`
Review error logs	Daily	`./scripts/log-monitor.sh -e`
Check resource usage	Daily	`docker stats --no-stream`
Review security events	Daily	`./scripts/log-monitor.sh -t`
Verify backup integrity	Weekly	Check `backups/` directory contents
Review SigNoz dashboards	Weekly	Check latency trends, error rate trends
Clean up Docker resources	Monthly	`docker system prune` (with caution)
Review disk usage	Monthly	`docker system df -v`