Operations

System Monitoring & Alerts Monitoreo del Sistema & Alertas

Automated health checks every 60 seconds, self-healing via auto-remediation, email alerts on failure/recovery, and a pipeline monitoring dashboard. All running inside the Go ingest service. Health checks automatizados cada 60 segundos, auto-reparacion via auto-remediacion, alertas por email en falla/recuperacion y un dashboard de monitoreo del pipeline. Todo corriendo dentro del servicio Go de ingesta.

1. Four Health Probes (every 60 seconds) 1. Cuatro Probes de Salud (cada 60 segundos)

The Go ingest service runs 4 independent health probes continuously. Each validates connectivity, response time, and data integrity for its target service. El servicio Go de ingesta ejecuta 4 probes de salud independientes continuamente. Cada uno valida conectividad, tiempo de respuesta e integridad de datos para su servicio objetivo.

ClickHouse

SELECT 1 + row count validation. Auto-restarts clickhouse-server after 3 consecutive failures.SELECT 1 + validacion de conteo. Auto-reinicia clickhouse-server despues de 3 fallos.

DragonflyDB

PING + test key write/read cycle. Auto-restarts dragonfly service.Comando PING + ciclo de escritura/lectura de key. Auto-reinicia servicio dragonfly.

R2 Storage

Object list + write test. External service — alert only, no auto-restart.Lista de objetos + prueba de escritura. Servicio externo — solo alerta, sin auto-reinicio.

Supabase

REST health + aggregate sync lag check. Alerts if sync lag exceeds 30 minutes.Salud REST + verificacion de retraso de sync. Alerta si retraso excede 30 minutos.

2. Auto-Remediation Flow 2. Flujo de Auto-Remediacion

When a probe detects a failure, the system follows a graduated response: retry, restart, verify, alert. This handles transient failures (memory pressure, connection pool exhaustion) without human intervention. Cuando un probe detecta una falla, el sistema sigue una respuesta graduada: reintentar, reiniciar, verificar, alertar. Esto maneja fallos transitorios (presion de memoria, agotamiento del pool de conexiones) sin intervencion humana.

// Auto-remediation sequence

Probe fails      → increment failure counter (1/3)
Probe fails      → increment failure counter (2/3)
Probe fails      → failure counter reaches 3Auto-restartsystemctl restart {service}Re-check         → wait 10s, probe again
                    ↓
  Service UP?    → send recovery notification (includes downtime duration)
  Still DOWN?    → send critical alert email, keep retrying every 60s

3. Alert Notifications 3. Notificaciones de Alertas

Alerts are sent via email using Cloudflare Workers + Mailgun. The Go service calls the CF Worker alert endpoint, which formats and delivers the email to the ops team. Las alertas se envian por email usando Cloudflare Workers + Mailgun. El servicio Go llama al endpoint de alerta del CF Worker, que formatea y entrega el email al equipo de operaciones.

// Alert endpoint
POST /api/admin/system-alert

// Payload
{
  "service":   "clickhouse",
  "status":    "down" | "recovered",
  "reason":    "Connection refused after 3 retries",
  "duration":  "2m 15s",   // only on recovery
  "timestamp": "2026-03-19T14:30:00Z"
}

// Alert lifecycle
Service fails  → 3 retries (90s) → restart attempt → alert email sent
Service recovers →                      recovery notification with downtime duration

4. Pipeline Monitoring Dashboard 4. Dashboard de Monitoreo del Pipeline

The admin dashboard includes a Data Pipeline page at /system/pipeline with comprehensive infrastructure visibility: El dashboard de admin incluye una pagina de Data Pipeline en /system/pipeline con visibilidad completa de la infraestructura:

Pipeline Dashboard FeaturesFunciones del Dashboard de Pipeline

Data Freshness Banner
  Shows last event time, warns if data is stale (>15 min)
  Yellow/red indicators based on staleness threshold

Service Health Cards
  Per-service status: ClickHouse, DragonflyDB, R2, Supabase
  Green/yellow/red indicators with last check timestamp
  Response time metrics

WAL Status
  Current WAL size and file count
  Last WAL write timestamp
  Pending replay count

DLQ Recovery
  Pending events per DLQ (S2S, Pixel, Click)
  Last recovery attempt and result
  Total events recovered today

Failover Status
  Current active tier (Primary / Standby / DLQ)
  Last failover event and duration
  VPS standby connectivity

Aggregator Sync
  Last sync timestamp and lag
  Rows synced per table (partner_daily, geo, media, hourly)
  7-day lookback window status

5. Monitoring API Endpoints 5. Endpoints de API de Monitoreo

// Health check (public)
GET  https://ingest.relo.mx/health
// Returns: service status, ClickHouse row count, uptime

// Detailed system health (admin)
GET  /api/system/health
// Proxied to Go service, returns all 4 probe results

// Data freshness (admin)
GET  /api/system/data-freshness
// Returns: last event time, aggregation lag, pipeline status

// System alert (internal)
POST /api/admin/system-alert
// Sends email alert via Mailgun