Monitoring
This guide covers the observability and monitoring stack for the eTeamups Platform (Zeswa Platform), including distributed tracing, health check endpoints, Docker health checks, and resource monitoring.
OpenTelemetry Integration
The platform uses OpenTelemetry (OTEL) to collect distributed traces and logs from all application services. The telemetry data is exported to a SigNoz backend for visualization and alerting.
How It Works
The instrumentation is initialized in instrumentation.ts at the root of the repository. When a service starts, the OpenTelemetry Node SDK is loaded and begins collecting:
- Distributed traces – HTTP requests, database queries, and inter-service calls are automatically instrumented.
- Logs – Winston log records are correlated with trace context and exported via OTLP.
The SDK configures the following exporters:
| Exporter | Protocol | Endpoint Environment Variable |
|---|---|---|
| Trace exporter | OTLP/HTTP | OTEL_EXPORTER_OTLP_ENDPOINT (appends /v1/traces) |
| Log exporter | OTLP/HTTP | OTEL_EXPORTER_OTLP_LOGS_ENDPOINT (appends /v1/logs) |
Each service identifies itself using the OTEL_SERVICE_NAME environment variable (defaults to zeswa-platform).
Auto-Instrumentation
The SDK uses @opentelemetry/auto-instrumentations-node which automatically instruments:
- Express HTTP handlers
- Outgoing HTTP/HTTPS requests
- MongoDB driver calls
- Redis client calls
File system instrumentation is disabled by default to reduce noise.
Winston Log Correlation
The WinstonInstrumentation from @opentelemetry/instrumentation-winston injects trace and span IDs into every Winston log record. This allows you to jump from a log entry in SigNoz directly to the associated trace.
Each log record is annotated with resource.service.name matching the value of OTEL_SERVICE_NAME.
Graceful Shutdown
The SDK listens for SIGTERM and flushes all pending telemetry data before the process exits. This ensures that traces and logs from the final requests are not lost during deployments or container restarts.
SigNoz Dashboards
SigNoz serves as the observability backend. It receives OTLP data and provides:
- Trace Explorer – Search and filter distributed traces across all microservices.
- Log Explorer – Query structured logs with full-text search and attribute filtering.
- Service Map – Visualize dependencies between auth, profile, organisation, media, and message-queue services.
- Dashboards – Build custom dashboards for latency percentiles, error rates, and throughput.
Recommended Dashboards
Create the following dashboards in SigNoz for production monitoring:
- Service Overview – P50/P95/P99 latency, request rate, and error rate per service.
- Database Performance – MongoDB query latency, connection pool usage, slow queries.
- API Gateway – Request volume by endpoint (
/auth,/profile,/organisation,/media), status code distribution. - Error Tracking – Error rate trends, top error messages grouped by service.
Health Check Endpoints
The platform exposes health check endpoints at multiple layers.
Application Health Checks
Every microservice registers two health endpoints in libs/server/src/Server.ts:
| Endpoint | Response | Purpose |
|---|---|---|
GET /_ping |
200 (empty body) |
Private liveness probe. Returns only a status code with no server metadata. |
GET /health |
200 {"status": "ok"} |
Container health probe. Used by Docker and load balancers. |
Nginx Health Check
The Nginx reverse proxy exposes its own health endpoint on port 80:
GET /health -> 200 "healthy\n"
This endpoint is defined in the HTTP server block in nginx/conf.d/eteamups.conf and has access_log off to avoid polluting logs with probe requests.
Each HTTPS virtual host (Hub, Admin, API) also exposes /health returning the same response.
API Gateway Service Health
Through the Nginx API gateway at api.zeswa.com, each service health can be checked via:
| URL | Upstream |
|---|---|
https://api.zeswa.com/auth/health |
auth-service:9000 |
https://api.zeswa.com/profile/health |
profile-service:9100 |
https://api.zeswa.com/organisation/health |
organisation-service:9107 |
https://api.zeswa.com/media/health |
media-service:9102 |
Health Check Script
The scripts/health-check.sh script performs a comprehensive check of the entire platform:
./scripts/health-check.sh
This script checks:
- Docker service status – Verifies all 10 containers are running and reports their health status (
healthy,running,unhealthy). - HTTP endpoints – Curls each public-facing URL and verifies the expected HTTP status code.
- Database connectivity – Runs
mongoshping andredis-cliping inside the respective containers. - Resource usage – Outputs
docker statsfor CPU and memory consumption. - Disk usage – Runs
docker system dfto show image, container, and volume disk usage.
The script exits with code 0 if all checks pass, or 1 with a count of failures and troubleshooting hints.
Configuration defaults can be overridden with environment variables:
API_GATEWAY_URL=https://api.zeswa.com
ADMIN_URL=https://admin.zeswa.com
HUB_URL=https://www.zeswa.com
These values are loaded from docker.env if present.
Docker Health Checks
Every service in docker-compose.yml has a health check configured. These health checks are used by Docker to determine container health and by depends_on conditions to control startup order.
Health Check Configuration by Service
| Service | Test Command | Interval | Timeout | Retries | Start Period |
|---|---|---|---|---|---|
| MongoDB | mongosh --eval "db.adminCommand('ping')" |
30s | 10s | 3 | 40s |
| Redis | redis-cli --raw incr ping |
30s | 10s | 3 | 20s |
| Auth Service | wget --spider http://127.0.0.1:9000/health |
30s | 10s | 3 | 40s |
| Profile Service | wget --spider http://127.0.0.1:9100/health |
30s | 10s | 3 | 40s |
| Organisation Service | wget --spider http://127.0.0.1:9107/health |
30s | 10s | 3 | 40s |
| Media Service | wget --spider http://127.0.0.1:9102/health |
30s | 10s | 3 | 40s |
| Message Queue | pgrep -f node |
30s | 10s | 3 | 40s |
| Admin Portal | wget --spider http://127.0.0.1:3000/ |
30s | 10s | 3 | 40s |
| Zeswa Hub | wget --spider http://127.0.0.1:4000/ |
30s | 10s | 3 | 40s |
| Nginx | wget --spider http://127.0.0.1/health |
30s | 10s | 3 | 20s |
Startup Order
Docker Compose uses depends_on with condition: service_healthy to enforce startup order:
- MongoDB and Redis start first.
- Application services (auth, profile, organisation, media, message-queue) start after MongoDB and Redis are healthy.
- Frontend apps (admin-portal, zeswa-hub) start after all application services are healthy.
- Nginx starts last, after all upstream services are healthy.
Restart Policies
Application services use a restart policy with backoff:
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
Infrastructure services (MongoDB, Redis, Nginx) use restart: always.
Resource Monitoring
Docker Stats
Monitor real-time resource consumption across all containers:
docker stats
For a one-time snapshot:
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
Resource Limits
Each service has defined resource limits and reservations in docker-compose.yml:
| Service | CPU Limit | Memory Limit | CPU Reservation | Memory Reservation |
|---|---|---|---|---|
| MongoDB | 2.0 | 2 GB | 0.5 | 512 MB |
| Redis | 1.0 | 768 MB | 0.25 | 256 MB |
| Auth Service | 1.0 | 512 MB | 0.25 | 128 MB |
| Profile Service | 1.0 | 512 MB | 0.25 | 128 MB |
| Organisation Service | 1.0 | 512 MB | 0.25 | 128 MB |
| Media Service | 1.0 | 1 GB | 0.25 | 256 MB |
| Message Queue | 0.5 | 512 MB | 0.1 | 128 MB |
| Admin Portal | 1.0 | 512 MB | 0.25 | 128 MB |
| Zeswa Hub | 1.0 | 512 MB | 0.25 | 128 MB |
| Nginx | 0.5 | 256 MB | 0.1 | 64 MB |
Docker Log Size Limits
All containers use the json-file logging driver with size limits:
- Infrastructure services (MongoDB, Redis):
max-size: 10m,max-file: 3 - Application services:
max-size: 10m,max-file: 5
Disk Usage
Check Docker disk usage:
docker system df
For detailed volume usage:
docker system df -v
Key Metrics to Watch
Application Metrics (via SigNoz)
- Request latency – P95 latency per service. Alert if consistently above 500ms.
- Error rate – Percentage of 5xx responses. Alert if above 1%.
- Request throughput – Requests per second per service. Watch for unexpected drops.
- Database query latency – MongoDB operation duration. Alert if above 100ms for reads.
Infrastructure Metrics
- Container CPU usage – Alert if any service sustains above 80% of its limit.
- Container memory usage – Alert if any service exceeds 80% of its allocated memory.
- Disk space – Monitor Docker volumes (especially
mongodb_dataandmedia_uploads). Alert below 10% free. - Redis memory – Redis is configured with
maxmemory 512mbandallkeys-lrueviction. Monitor memory usage approaching the limit.
Nginx Metrics
- Request rate – Extracted from access logs. Use
scripts/log-monitor.sh -pfor a per-hour breakdown. - 4xx/5xx rates – Monitor
404-errors.logandserver-errors.logfor spikes. - Upstream response time – Available in the
detailedlog format via theurtfield. - Rate limit rejections – 429 status codes indicate rate limiting is being triggered.
Alerting Recommendations
Suggested Alert Rules
| Alert | Condition | Severity | Channel |
|---|---|---|---|
| Service Down | Health check fails 3 consecutive times | Critical | PagerDuty / Slack |
| High Error Rate | 5xx rate > 1% for 5 minutes | Critical | Slack |
| High Latency | P95 > 2s for 5 minutes | Warning | Slack |
| Memory Pressure | Container memory > 85% of limit | Warning | Slack |
| Disk Space Low | Volume usage > 90% | Warning | |
| MongoDB Slow Query | Query duration > 500ms | Warning | Slack |
| Redis Memory High | Memory usage > 400 MB (of 512 MB limit) | Warning | Slack |
| Rate Limit Spike | 429 responses > 100/min | Info | Slack |
| SSL Certificate Expiry | Certificate expires within 14 days | Warning |
Setting Up Alerts in SigNoz
- Navigate to the Alerts section in the SigNoz UI.
- Create a new alert rule using the PromQL or ClickHouse query builder.
- Set the threshold, evaluation interval, and notification channel.
- Configure notification channels (Slack webhook, PagerDuty, email) under Settings > Alert Channels.
External Health Monitoring
For external uptime monitoring, configure an external service (e.g., UptimeRobot, Pingdom) to periodically check:
https://api.zeswa.com/healthhttps://www.zeswa.com/healthhttps://admin.zeswa.com/health
These endpoints return 200 with "healthy\n" when Nginx and its upstream services are operational.
Periodic Monitoring Tasks
| Task | Frequency | Command / Action |
|---|---|---|
| Run full health check | Every deployment, daily | ./scripts/health-check.sh |
| Review error logs | Daily | ./scripts/log-monitor.sh -e |
| Check resource usage | Daily | docker stats --no-stream |
| Review security events | Daily | ./scripts/log-monitor.sh -t |
| Verify backup integrity | Weekly | Check backups/ directory contents |
| Review SigNoz dashboards | Weekly | Check latency trends, error rate trends |
| Clean up Docker resources | Monthly | docker system prune (with caution) |
| Review disk usage | Monthly | docker system df -v |