Automated health checks every 60 seconds, self-healing via auto-remediation, email alerts on failure/recovery, and a pipeline monitoring dashboard. All running inside the Go ingest service. Health checks automatizados cada 60 segundos, auto-reparacion via auto-remediacion, alertas por email en falla/recuperacion y un dashboard de monitoreo del pipeline. Todo corriendo dentro del servicio Go de ingesta.
The Go ingest service runs 4 independent health probes continuously. Each validates connectivity, response time, and data integrity for its target service. El servicio Go de ingesta ejecuta 4 probes de salud independientes continuamente. Cada uno valida conectividad, tiempo de respuesta e integridad de datos para su servicio objetivo.
SELECT 1 + row count validation. Auto-restarts clickhouse-server after 3 consecutive failures.SELECT 1 + validacion de conteo. Auto-reinicia clickhouse-server despues de 3 fallos.
PING + test key write/read cycle. Auto-restarts dragonfly service.Comando PING + ciclo de escritura/lectura de key. Auto-reinicia servicio dragonfly.
Object list + write test. External service — alert only, no auto-restart.Lista de objetos + prueba de escritura. Servicio externo — solo alerta, sin auto-reinicio.
REST health + aggregate sync lag check. Alerts if sync lag exceeds 30 minutes.Salud REST + verificacion de retraso de sync. Alerta si retraso excede 30 minutos.
When a probe detects a failure, the system follows a graduated response: retry, restart, verify, alert. This handles transient failures (memory pressure, connection pool exhaustion) without human intervention. Cuando un probe detecta una falla, el sistema sigue una respuesta graduada: reintentar, reiniciar, verificar, alertar. Esto maneja fallos transitorios (presion de memoria, agotamiento del pool de conexiones) sin intervencion humana.
// Auto-remediation sequence Probe fails → increment failure counter (1/3) Probe fails → increment failure counter (2/3) Probe fails → failure counter reaches 3 ↓ Auto-restart → systemctl restart {service} ↓ Re-check → wait 10s, probe again ↓ Service UP? → send recovery notification (includes downtime duration) Still DOWN? → send critical alert email, keep retrying every 60s
Alerts are sent via email using Cloudflare Workers + Mailgun. The Go service calls the CF Worker alert endpoint, which formats and delivers the email to the ops team. Las alertas se envian por email usando Cloudflare Workers + Mailgun. El servicio Go llama al endpoint de alerta del CF Worker, que formatea y entrega el email al equipo de operaciones.
// Alert endpoint POST /api/admin/system-alert // Payload { "service": "clickhouse", "status": "down" | "recovered", "reason": "Connection refused after 3 retries", "duration": "2m 15s", // only on recovery "timestamp": "2026-03-19T14:30:00Z" } // Alert lifecycle Service fails → 3 retries (90s) → restart attempt → alert email sent Service recovers → recovery notification with downtime duration
The admin dashboard includes a Data Pipeline page at /system/pipeline with comprehensive infrastructure visibility:
El dashboard de admin incluye una pagina de Data Pipeline en /system/pipeline con visibilidad completa de la infraestructura:
Data Freshness Banner Shows last event time, warns if data is stale (>15 min) Yellow/red indicators based on staleness threshold Service Health Cards Per-service status: ClickHouse, DragonflyDB, R2, Supabase Green/yellow/red indicators with last check timestamp Response time metrics WAL Status Current WAL size and file count Last WAL write timestamp Pending replay count DLQ Recovery Pending events per DLQ (S2S, Pixel, Click) Last recovery attempt and result Total events recovered today Failover Status Current active tier (Primary / Standby / DLQ) Last failover event and duration VPS standby connectivity Aggregator Sync Last sync timestamp and lag Rows synced per table (partner_daily, geo, media, hourly) 7-day lookback window status
// Health check (public) GET https://ingest.relo.mx/health // Returns: service status, ClickHouse row count, uptime // Detailed system health (admin) GET /api/system/health // Proxied to Go service, returns all 4 probe results // Data freshness (admin) GET /api/system/data-freshness // Returns: last event time, aggregation lag, pipeline status // System alert (internal) POST /api/admin/system-alert // Sends email alert via Mailgun