Infrastructure Architecture Arquitectura de Infraestructura

How RELO guarantees zero data loss, sub-50ms latency, and automatic recovery from any failure — including total server destruction. Como RELO garantiza cero perdida de datos, latencia <50ms y recuperacion automatica ante cualquier falla — incluyendo destruccion total del servidor.

● Zero Data Loss ● Cero Perdida de Datos
● <50ms Edge Latency
● <2 min Recovery ● <2 min Recuperacion
● 99.99% Uptime
Three-Tier Architecture Arquitectura de Tres Niveles
Every event passes through three independent layers. Each layer can fail independently without losing a single event. Cada evento pasa por tres capas independientes. Cada capa puede fallar de forma independiente sin perder un solo evento.
Tier 1 — Edge Network (300+ Global PoPs) Nivel 1 — Red Edge (300+ PoPs Globales)
🔗

Click Wrapper

t.relo.mx — ULID generation, first-party cookie, 302 redirect. <50ms P99. t.relo.mx — Generacion ULID, cookie first-party, 302 redirect. <50ms P99.

🔎

Web Pixel

p.relo.mx — 143-line JS loader. Beacon API with device fingerprinting. p.relo.mx — Loader JS de 143 lineas. Beacon API con fingerprinting.

📡

S2S Proxy

s2s.relo.mx — HMAC verification, TLS 1.3 termination, DDoS protection. s2s.relo.mx — Verificacion HMAC, terminacion TLS 1.3, proteccion DDoS.

API Gateway

relo-api — 250+ endpoints, JWT auth, CORS, rate limiting. relo-api — 250+ endpoints, auth JWT, CORS, rate limiting.

▼ ▼ ▼
3-TIER FAILOVER: Primary → Standby → Object Store DLQ FAILOVER 3 NIVELES: Primario → Standby → Object Store DLQ
Tier 2 — Processing Engine (Dedicated Server) Nivel 2 — Motor de Procesamiento (Servidor Dedicado)
⚙️

Go Ingest Service

6,684 lines, 16 packages. fasthttp, batch writer, identity resolution, fraud scoring. 6,684 lineas, 16 paquetes. fasthttp, escritura batch, resolucion de identidad, scoring de fraude.

🗃️

ClickHouse

Columnar analytics engine. 35-field schema, 8.6x compression, 24-month TTL. 12 tables incl. 6 MVs. Motor analitico columnar. Schema de 35 campos, 8.6x compresion, TTL de 24 meses. 12 tablas incl. 6 MVs.

🧠

DragonflyDB

Identity graph, fraud cache, real-time counters. Sub-ms lookups. 29x faster than Redis. Grafo de identidad, cache de fraude, contadores en tiempo real. Lookups sub-ms. 29x mas rapido que Redis.

▼ ▼ ▼
Dual-write + Aggregator sync every 10 min Escritura dual + Sincronizacion de agregados cada 10 min
Tier 3 — Platform & Activation Nivel 3 — Plataforma y Activacion
🏛️

Supabase PostgreSQL

50+ tables. Config, partners, payments, aggregates. Constant-size forever. 50+ tablas. Configuracion, partners, pagos, agregados. Tamano constante por siempre.

📦

Cloudflare R2

Data lake, CSV archive, backups. 11 nines durability. Zero egress fees. Data lake, archivo CSV, backups. 11 nueves de durabilidad. Cero costo de egress.

🎯

DSP Activation

Meta CAPI, Google Ads, TikTok API. Real-time purchase event forwarding. Meta CAPI, Google Ads, TikTok API. Envio de eventos de compra en tiempo real.

Event Lifecycle — Click to Dashboard Ciclo de Vida del Evento — Del Click al Dashboard
Every event is persisted to at least 2 independent storage systems before acknowledgment. Here's exactly how data flows through the system. Cada evento se persiste en al menos 2 sistemas de almacenamiento independientes antes de confirmacion. Asi es exactamente como fluyen los datos.
StepWhat HappensStorage WrittenLatency
1. Event arrivesClick, pixel beacon, or S2S postback hits Cloudflare edge workerEdge memory (in-flight)<5ms
2. Edge processingULID generated, cookies set, geo headers captured, forwarded to Go service<10ms
3. R2 WALEvent written to R2 write-ahead log as gzipped NDJSON before processing. Key format: wal/events/YYYY/MM/DD/HH/{ulid}.ndjson.gz. Guarantees zero data loss even if Go service crashes mid-flight. Cost: ~$2-3/mo.R2 (11 nines durability)<20ms
4. Go ingestHMAC verification, GeoIP enrichment, identity resolution, fraud scoring (2µs ML)<5ms
5. Batch writeEvent buffered, flushed every 1,000 events or 1 second (whichever first)ClickHouse (NVMe RAID-1)<1s
6. Dual-writePurchase events also written to Supabase for real-time dashboard queriesSupabase (Cloud PostgreSQL)<100ms
7. Identity updateDevice profile updated with new event, cross-device links maintainedDragonflyDB (in-memory, 90-day TTL)<1ms
8. AggregationClickHouse MVs auto-refresh every 5 min. Aggregator syncs to Supabase every 10 min (partner_daily_aggregates, geo_daily_aggregates, media_source_daily, hourly_stats). Dashboard reads from aggregates; Orders tab uses direct ClickHouse proxy via Go endpoint.Supabase (4 aggregate tables)5-10 min
9. DLQ recoveryDead Letter Queue checks every 60s for failed events and auto-replays them to the Go service. R2 replay tool available for full disaster recovery.R2 DLQ60s cycle
10. DSP exportPurchase events batched and sent to Meta CAPI within 5 secondsMeta servers (external)<5s
PasoQue SucedeAlmacenamientoLatencia
1. Evento llegaClick, beacon de pixel o postback S2S llega al edge worker de CloudflareMemoria edge (en vuelo)<5ms
2. Procesamiento edgeULID generado, cookies seteadas, headers geo capturados, reenviado a servicio Go<10ms
3. R2 WALEvento escrito en el write-ahead log de R2 como NDJSON comprimido con gzip antes de procesarse. Key: wal/events/YYYY/MM/DD/HH/{ulid}.ndjson.gz. Garantiza cero perdida de datos aunque el servicio Go falle a mitad de vuelo. Costo: ~$2-3/mes.R2 (11 nueves de durabilidad)<20ms
4. Ingesta GoVerificacion HMAC, enriquecimiento GeoIP, resolucion de identidad, scoring de fraude (2µs ML)<5ms
5. Escritura batchEvento en buffer, flush cada 1,000 eventos o 1 segundo (lo que ocurra primero)ClickHouse (NVMe RAID-1)<1s
6. Escritura dualEventos de compra tambien escritos en Supabase para consultas de dashboard en tiempo realSupabase (Cloud PostgreSQL)<100ms
7. Actualizacion identidadPerfil de dispositivo actualizado con nuevo evento, enlaces cross-device mantenidosDragonflyDB (en memoria, TTL 90 dias)<1ms
8. AgregacionMVs de ClickHouse se refrescan cada 5 min. Agregador sincroniza a Supabase cada 10 min (partner_daily_aggregates, geo_daily_aggregates, media_source_daily, hourly_stats). Dashboard lee de agregados; tab de Ordenes usa proxy directo a ClickHouse via endpoint Go.Supabase (4 tablas de agregados)5-10 min
9. Recuperacion DLQDead Letter Queue revisa cada 60s eventos fallidos y los replaya al servicio Go. Herramienta de replay R2 disponible para recuperacion completa de desastres.R2 DLQciclo 60s
10. Exportacion DSPEventos de compra agrupados y enviados a Meta CAPI en menos de 5 segundosServidores Meta (externo)<5s

🔒 Data Durability at Every Step Durabilidad de Datos en Cada Paso

At step 3, the event is in R2 WAL (11 nines). At step 5, it exists in ClickHouse (RAID-1 NVMe). At step 6, purchases also exist in Supabase (cloud-replicated). At step 8, aggregated data feeds the dashboard from 4 aggregate tables (partner_daily, geo_daily, media_source_daily, hourly_stats). Step 9 auto-recovers any failed events from DLQ. Daily backups copy everything to R2. En el paso 3, el evento esta en R2 WAL (11 nueves). En el paso 5, existe en ClickHouse (NVMe RAID-1). En el paso 6, las compras tambien existen en Supabase (replicado en nube). En el paso 8, los datos agregados alimentan el dashboard desde 4 tablas de agregados (partner_daily, geo_daily, media_source_daily, hourly_stats). El paso 9 auto-recupera eventos fallidos de la DLQ. Backups diarios copian todo a R2.

3
Independent Storage Systems Sistemas de Almacenamiento Independientes
11
Nines Durability (R2) Nueves de Durabilidad (R2)
RAID-1
Mirrored NVMe Drives Discos NVMe en Espejo
24 mo
Automatic Retention Retencion Automatica
What Happens When Things Break Que Pasa Cuando Algo Falla
Every data source has a multi-tier failover chain. If the primary path fails, the next tier catches the event. No event is ever dropped. Cada fuente de datos tiene una cadena de failover multi-nivel. Si la ruta primaria falla, el siguiente nivel captura el evento. Ningun evento se pierde jamas.
🔗 Click Wrapper — 3-Tier Failover Click Wrapper — Failover de 3 Niveles
1

Go Ingest (Primary)

Direct POST to Go service on dedicated server. Normal operation path. POST directo al servicio Go en servidor dedicado. Ruta de operacion normal.

<50ms
2

Cloudflare KV (Fallback)

If Go is down, click data written to KV with 24-hour TTL. Go recovers it on restart. Si Go esta caido, datos de click escritos en KV con TTL de 24 horas. Go los recupera al reiniciar.

~30s recovery ~30s recuperacion
3

R2 Dead Letter Queue

Last resort: event written as JSON to R2 object store. Permanent, 11 nines durability. Ultimo recurso: evento escrito como JSON en R2 object store. Permanente, 11 nueves de durabilidad.

~60s recovery ~60s recuperacion
🔎 Web Pixel — 3-Tier Failover Web Pixel — Failover de 3 Niveles
1

Go Ingest (Primary)

Beacon events forwarded to Go service for processing and ClickHouse write. Eventos beacon reenviados al servicio Go para procesamiento y escritura en ClickHouse.

<100ms
2

VPS Standby

If Go primary is down, pixel events forwarded to VPS warm standby. Si Go primario esta caido, eventos de pixel reenviados a VPS standby.

<2 min
3

R2 Dead Letter Queue

Last resort: pixel events buffered to R2 DLQ. Auto-replay every 60s. Ultimo recurso: eventos de pixel almacenados en R2 DLQ. Auto-replay cada 60s.

~60s recovery ~60s recuperacion
📡 S2S Postbacks — 3-Tier Failover Postbacks S2S — Failover de 3 Niveles
1

Go Ingest (Primary)

S2S postbacks from AppsFlyer/Branch. HMAC verified, enriched, written to ClickHouse. Postbacks S2S de AppsFlyer/Branch. HMAC verificado, enriquecido, escrito en ClickHouse.

<200ms
2

Warm Standby VPS

Always-on receiver on separate server. Buffers events to disk as JSON files. Replayed after recovery. Receptor siempre activo en servidor separado. Almacena eventos en disco como JSON. Se reproducen tras la recuperacion.

<2 min failover
3

R2 Dead Letter Queue

If both primary and standby are down, events buffered to R2 DLQ. Auto-replay every 60s. Si primario y standby estan caidos, eventos almacenados en R2 DLQ. Auto-replay cada 60s.

~60s recovery ~60s recuperacion
🔁 AppsFlyer Pull API — Hourly Batch AppsFlyer Pull API — Batch por Hora

Hourly Pull (Primary) Pull Cada Hora (Primario)

Systemd timer pulls all events from AppsFlyer API every 60 minutes. Replaces S2S for most data. Timer de systemd extrae todos los eventos de la API de AppsFlyer cada 60 minutos. Reemplaza S2S para la mayoria de datos.

60 min interval Intervalo de 60 min
💾

AppsFlyer Data Retention Retencion de Datos AppsFlyer

AppsFlyer retains all raw data for 90 days. If our pull fails, we can always re-pull historical data. AppsFlyer retiene todos los datos crudos por 90 dias. Si nuestro pull falla, siempre podemos re-extraer datos historicos.

90-day window Ventana de 90 dias
Every Worst-Case Scenario — Covered Cada Peor Escenario — Cubierto
We've planned for total server destruction, datacenter fires, provider outages, and more. Here's exactly what happens in each scenario. Hemos planeado para destruccion total del servidor, incendios en datacenter, caidas de proveedores y mas. Esto es exactamente lo que pasa en cada escenario.
🔥 Server catches fire 🔥 El servidor se incendia
Dedicated server (ClickHouse, Go, DragonflyDB) completely destroyed. Servidor dedicado (ClickHouse, Go, DragonflyDB) completamente destruido.
  1. Edge workers detect health check failure (3 consecutive, 30s apart)
  2. Click wrapper falls back to KV → R2 DLQ chain
  3. Pixel falls back to R2 DLQ
  4. S2S switches to warm standby (buffers to disk)
  5. Spin up new server, restore from R2 daily backup
  6. Replay standby buffer + R2 DLQ events into new ClickHouse
  7. DragonflyDB rebuilt from ClickHouse data
  1. Edge workers detectan falla de health check (3 consecutivos, 30s entre cada uno)
  2. Click wrapper cae a cadena KV → R2 DLQ
  3. Pixel cae a R2 DLQ
  4. S2S cambia a standby en caliente (almacena en disco)
  5. Levantar nuevo servidor, restaurar desde backup diario en R2
  6. Reproducir buffer de standby + eventos R2 DLQ en nuevo ClickHouse
  7. DragonflyDB reconstruido desde datos de ClickHouse
✓ Zero data loss — RTO: 2-4 hours ✓ Cero perdida de datos — RTO: 2-4 horas
🔌 Cloudflare goes down 🔌 Cloudflare se cae
Edge workers, KV, R2, and Pipelines all unavailable. Edge workers, KV, R2 y Pipelines todos no disponibles.
  1. No clicks, pixels, or API requests reach the system
  2. Go service still running, but no new events arriving
  3. AppsFlyer Pull API continues (direct to Hetzner via tunnel)
  4. When CF recovers, all queued events process normally
  5. Any events during outage still in AppsFlyer (90-day retention)
  1. Ningun click, pixel o peticion API llega al sistema
  2. Servicio Go sigue corriendo, pero no llegan nuevos eventos
  3. AppsFlyer Pull API continua (directo a Hetzner via tunnel)
  4. Cuando CF se recupera, todos los eventos encolados se procesan normalmente
  5. Cualquier evento durante la caida sigue en AppsFlyer (retencion 90 dias)
✓ Zero data loss — data recovered from AppsFlyer ✓ Cero perdida de datos — datos recuperados de AppsFlyer
💣 Supabase goes down 💣 Supabase se cae
Platform DB (config, payments, aggregates) unavailable. DB de plataforma (config, pagos, agregados) no disponible.
  1. Dashboard and portal temporarily unavailable
  2. ClickHouse continues receiving and storing ALL events normally
  3. DragonflyDB identity graph continues working
  4. Events keep flowing — nothing lost
  5. When Supabase recovers, aggregator syncs 7-day lookback window
  1. Dashboard y portal temporalmente no disponibles
  2. ClickHouse sigue recibiendo y almacenando TODOS los eventos normalmente
  3. Grafo de identidad de DragonflyDB sigue funcionando
  4. Los eventos siguen fluyendo — nada se pierde
  5. Cuando Supabase se recupera, el agregador sincroniza ventana de 7 dias
✓ Zero data loss — events buffered in ClickHouse ✓ Cero perdida de datos — eventos almacenados en ClickHouse
⚡ Go service crashes ⚡ Servicio Go se cae
Ingest service stops accepting events. Servicio de ingesta deja de aceptar eventos.
  1. Systemd auto-restarts service within 5 seconds
  2. During restart: clicks → KV fallback → R2 DLQ
  3. During restart: pixels → R2 DLQ
  4. Health check runs every 60 seconds, auto-remediation restarts service, alerts on failure
  5. On restart, Go recovers KV and R2 DLQ events automatically
  1. Systemd reinicia automaticamente el servicio en menos de 5 segundos
  2. Durante el reinicio: clicks → fallback KV → R2 DLQ
  3. Durante el reinicio: pixels → R2 DLQ
  4. Health check corre cada 60 segundos, auto-remediacion reinicia servicio, alerta en caso de falla
  5. Al reiniciar, Go recupera eventos de KV y R2 DLQ automaticamente
✓ Zero data loss — RTO: <10 seconds ✓ Cero perdida de datos — RTO: <10 segundos
💾 ClickHouse disk failure 💾 Falla de disco ClickHouse
One NVMe drive dies. Un disco NVMe muere.
  1. RAID-1 continues operating on the surviving drive
  2. No data loss, no downtime — transparent failover
  3. Health check detects degraded RAID, sends alert
  4. Replace failed drive, RAID rebuilds automatically
  1. RAID-1 continua operando con el disco sobreviviente
  2. Sin perdida de datos, sin downtime — failover transparente
  3. Health check detecta RAID degradado, envia alerta
  4. Reemplazar disco fallido, RAID se reconstruye automaticamente
✓ Zero data loss, zero downtime ✓ Cero perdida de datos, cero downtime
🔐 DragonflyDB crashes 🔐 DragonflyDB se cae
Identity graph and real-time counters lost from memory. Grafo de identidad y contadores en tiempo real perdidos de memoria.
  1. Systemd restarts DragonflyDB immediately
  2. Identity graph rebuilt from ClickHouse events (scripted)
  3. Events continue flowing to ClickHouse unaffected
  4. Fraud scores temporarily unavailable (events still stored)
  5. Full rebuild takes ~30 minutes for 90 days of data
  1. Systemd reinicia DragonflyDB inmediatamente
  2. Grafo de identidad reconstruido desde eventos de ClickHouse (automatizado)
  3. Eventos siguen fluyendo a ClickHouse sin afectacion
  4. Scores de fraude temporalmente no disponibles (eventos se siguen almacenando)
  5. Reconstruccion completa toma ~30 minutos para 90 dias de datos
⚠ Temporary degradation — zero data loss ⚠ Degradacion temporal — cero perdida de datos
Automated Daily + Weekly Backups to R2 Backups Automaticos Diarios + Semanales a R2
ClickHouse data is automatically exported and uploaded to Cloudflare R2 (11 nines durability) on a scheduled basis. No manual intervention required. Los datos de ClickHouse se exportan automaticamente y se suben a Cloudflare R2 (11 nueves de durabilidad) de forma programada. Sin intervencion manual.
Every 60s Cada 60s
💚

Health Check + Auto-Remediation

4 probes (ClickHouse, DragonflyDB, R2, Supabase). Auto-restarts failed services after 3 consecutive failures. Alerts via email. 4 probes (ClickHouse, DragonflyDB, R2, Supabase). Auto-reinicia servicios fallidos despues de 3 fallos consecutivos. Alertas por email.

Every 10 min Cada 10 min
🔄

Aggregator Sync Sincronizacion de Agregados

ClickHouse aggregated data synced to Supabase partner_daily_aggregates. 7-day lookback window. Datos agregados de ClickHouse sincronizados a Supabase partner_daily_aggregates. Ventana de 7 dias.

Daily 4:00 AM Diario 4:00 AM
💾

Incremental Backup Backup Incremental

Last 7 days of events + all aggregate tables exported and uploaded to R2 via rclone. Typical size: ~20KB compressed. Ultimos 7 dias de eventos + todas las tablas de agregados exportadas y subidas a R2 via rclone. Tamano tipico: ~20KB comprimido.

Weekly (Sun) Semanal (Dom)
🗃️

Full Backup Backup Completo

Complete export of ALL tables: events, clicks, partner_daily_aggregates, audience_members, audience_segments, user_features, campaign_costs. Uploaded to R2. Exportacion completa de TODAS las tablas: events, clicks, partner_daily_aggregates, audience_members, audience_segments, user_features, campaign_costs. Subido a R2.

Every 60 min Cada 60 min
🔁

AppsFlyer Pull

Full event data pulled from AppsFlyer API. Serves as independent external backup — 90-day retention on AppsFlyer's side. Datos completos de eventos extraidos de la API de AppsFlyer. Sirve como backup externo independiente — retencion de 90 dias del lado de AppsFlyer.

📋 Backup Verification (actual log output) 📋 Verificacion de Backups (salida real de logs)

$ crontab -l
# Health check now runs inside Go service every 60s (self-healing monitor)
0 4 * * *   /opt/relo-ingest/backup-clickhouse.sh --daily
0 4 * * 0   /opt/relo-ingest/backup-clickhouse.sh --full

$ tail -5 /var/log/ch-backup.log
[2026-03-12 04:00:01] Starting daily backup...
[2026-03-12 04:00:02] Exported events (last 7d): 1,247 rows
[2026-03-12 04:00:02] Exported partner_daily_aggregates: 892 rows
[2026-03-12 04:00:03] Uploading to R2: relo-backups/daily/2026-03-12/
[2026-03-12 04:00:04] ✓ Backup complete. Size: 17.6KB
Automated Health Checks & Remediation Health Checks Automatizados & Remediacion
The Go ingest service runs a self-healing monitor that probes 4 dependencies every 60 seconds. After 3 consecutive failures, it restarts the failed service and sends an alert email. Recovery notifications include downtime duration. El servicio Go de ingesta ejecuta un monitor auto-reparable que verifica 4 dependencias cada 60 segundos. Despues de 3 fallos consecutivos, reinicia el servicio fallido y envia un email de alerta. Las notificaciones de recuperacion incluyen la duracion del downtime.
ClickHouse
💚

Query ProbeProbe de Consulta

SELECT 1 + row count validation. Auto-restarts clickhouse-server after 3 failures. SELECT 1 + validacion de conteo de filas. Auto-reinicia clickhouse-server despues de 3 fallos.

DragonflyDB
💚

Ping + Read/WritePing + Lectura/Escritura

PING command + test key write/read. Auto-restarts dragonfly service. Comando PING + escritura/lectura de key de prueba. Auto-reinicia servicio dragonfly.

R2 Storage
💚

Object ProbeProbe de Objetos

Object list + write test. No auto-restart (external service) — alert only. Lista de objetos + prueba de escritura. Sin auto-reinicio (servicio externo) — solo alerta.

Supabase
💚

REST Health + Sync CheckSalud REST + Verificacion de Sync

REST API health check + aggregate sync lag verification. Alert if lag exceeds 30 minutes. Health check de API REST + verificacion de retraso de sincronizacion de agregados. Alerta si el retraso excede 30 minutos.

Six New Reliability Layers (2026-03-19) Seis Nuevas Capas de Confiabilidad (2026-03-19)
Deployed March 19, 2026. These improvements add OS-level process supervision, independent external monitoring, data integrity verification, early-warning DLQ metrics, database backup coverage, and enhanced disaster recovery testing. Desplegado el 19 de marzo de 2026. Estas mejoras agregan supervision de procesos a nivel de SO, monitoreo externo independiente, verificacion de integridad de datos, metricas de alerta temprana de DLQ, cobertura de backup de base de datos y pruebas mejoradas de recuperacion ante desastres.

🕒 Systemd Watchdog Watchdog de Systemd

Go ingest service uses Type=notify with WatchdogSec=120. The service sends heartbeat notifications to systemd every 30 seconds (4x safety margin). If the process hangs (deadlock, infinite loop, memory stall), systemd automatically kills and restarts it within 120 seconds — even if the process is still technically "running." Package: backbone/internal/watchdog/. El servicio Go de ingesta usa Type=notify con WatchdogSec=120. El servicio envia notificaciones de heartbeat a systemd cada 30 segundos (margen de seguridad 4x). Si el proceso se congela (deadlock, loop infinito, stall de memoria), systemd automaticamente lo mata y reinicia en 120 segundos — aunque el proceso siga tecnicamente "corriendo." Paquete: backbone/internal/watchdog/.

🌐 External Health Monitor Monitor de Salud Externo

A Cloudflare Worker cron job pings the backbone /health endpoint every hour — completely independent from the Go process itself. If both the primary backbone and VPS standby are unreachable, an alert email is sent. This catches scenarios where the Go self-healing monitor can't report its own failure. Un cron job de Cloudflare Workers hace ping al endpoint /health del backbone cada hora — completamente independiente del proceso Go. Si tanto el backbone primario como el VPS standby estan inalcanzables, se envia un email de alerta. Esto cubre escenarios donde el monitor auto-reparable del Go no puede reportar su propia falla.

🔎 WAL Integrity Verification Verificacion de Integridad del WAL

GET /wal/verify samples recent R2 WAL files, decompresses them from gzip, and validates JSON structure of every line. Catches silent corruption (bit rot, truncated writes, compression errors) before it matters. If any file fails validation, the health endpoint reflects degraded status. GET /wal/verify muestrea archivos recientes del WAL en R2, los descomprime de gzip y valida la estructura JSON de cada linea. Detecta corrupcion silenciosa (bit rot, escrituras truncadas, errores de compresion) antes de que importe. Si algun archivo falla la validacion, el endpoint de salud refleja estado degradado.

📈 DLQ Depth Monitoring Monitoreo de Profundidad de DLQ

The /health endpoint now exposes real-time Dead Letter Queue depth counters for all three event sources: clicks, S2S postbacks, and pixel events. A growing DLQ depth is an early warning that the primary path is failing and events are accumulating in the fallback queue. Enables proactive intervention before data delivery lag becomes visible in dashboards. El endpoint /health ahora expone contadores en tiempo real de profundidad de Dead Letter Queue para las tres fuentes de eventos: clicks, postbacks S2S y eventos de pixel. Una profundidad de DLQ creciente es una alerta temprana de que la ruta primaria esta fallando y los eventos se estan acumulando en la cola de respaldo. Permite intervencion proactiva antes de que el retraso en entrega de datos sea visible en los dashboards.

💾 Supabase Daily Backup + PITR Backup Diario de Supabase + PITR

Supabase built-in daily backups plus Point-in-Time Recovery (PITR) protect all configuration tables: clients, partners, commissions, budgets, payments, user profiles, and AI knowledge base. PITR enables recovery to any second within the retention window. This complements the ClickHouse/R2 backup strategy by covering the platform database layer. Backups diarios integrados de Supabase mas Point-in-Time Recovery (PITR) protegen todas las tablas de configuracion: clientes, partners, comisiones, presupuestos, pagos, perfiles de usuario y base de conocimiento de IA. PITR permite recuperacion a cualquier segundo dentro de la ventana de retencion. Esto complementa la estrategia de backup de ClickHouse/R2 cubriendo la capa de base de datos de plataforma.

🔧 Enhanced DR Drills Pruebas de DR Mejoradas

The disaster recovery test script drp-test.sh now validates WAL integrity (decompresses + checks JSON) and verifies DLQ depth counters are at zero during automated DR tests. This ensures that both the write-ahead log and dead letter queues are functioning correctly — not just that services respond to health checks. El script de prueba de recuperacion ante desastres drp-test.sh ahora valida la integridad del WAL (descomprime + verifica JSON) y comprueba que los contadores de profundidad de DLQ esten en cero durante pruebas automatizadas de DR. Esto asegura que tanto el write-ahead log como las colas de dead letter estan funcionando correctamente — no solo que los servicios respondan a health checks.

🛠 HA Stack Summary (as of March 2026) 🛠 Resumen del Stack HA (a marzo 2026)

LayerMechanismCatches
Process hangSystemd watchdog (120s)Deadlocks, infinite loops, memory stalls
Process crashSystemd auto-restart (5s)Panics, OOM kills, segfaults
Dependency failureSelf-healing monitor (60s probes)ClickHouse, DragonflyDB, R2, Supabase down
Total backbone downExternal CF Worker cron (hourly)Network outage, datacenter fire, full server loss
Silent data corruptionWAL integrity verificationBit rot, truncated writes, gzip corruption
Failover queue buildupDLQ depth counters on /healthDegraded primary path before visible impact
Config DB lossSupabase PITR + daily backupsAccidental deletes, schema corruption
Recovery validationdrp-test.sh (WAL + DLQ checks)Backup integrity failures, stale DLQ events
CapaMecanismoDetecta
Proceso colgadoWatchdog de systemd (120s)Deadlocks, loops infinitos, stalls de memoria
Proceso crasheadoAuto-reinicio de systemd (5s)Panics, OOM kills, segfaults
Falla de dependenciaMonitor auto-reparable (probes cada 60s)ClickHouse, DragonflyDB, R2, Supabase caidos
Backbone totalmente caidoCron externo de CF Worker (cada hora)Caida de red, incendio en DC, perdida total del servidor
Corrupcion silenciosaVerificacion de integridad del WALBit rot, escrituras truncadas, corrupcion de gzip
Acumulacion en cola de failoverContadores de profundidad de DLQ en /healthRuta primaria degradada antes de impacto visible
Perdida de DB de configPITR de Supabase + backups diariosBorrados accidentales, corrupcion de schema
Validacion de recuperaciondrp-test.sh (checks de WAL + DLQ)Fallas de integridad de backup, eventos DLQ obsoletos
Four Layers of Authentication Cuatro Capas de Autenticacion
Every request to RELO passes through multiple authentication checks. All authentication follows a fail-closed pattern — if in doubt, reject. Cada peticion a RELO pasa por multiples verificaciones de autenticacion. Toda la autenticacion sigue un patron fail-closed — en caso de duda, rechazar.

🔒 Layer 1: HMAC-SHA256 Webhook Signatures Capa 1: Firmas HMAC-SHA256 para Webhooks

All S2S postbacks from AppsFlyer are verified using HMAC-SHA256 signatures. The Go middleware computes HMAC(body, secret) and compares against X-AF-Signature header. Todos los postbacks S2S de AppsFlyer se verifican usando firmas HMAC-SHA256. El middleware Go calcula HMAC(body, secret) y compara contra el header X-AF-Signature.

Fail-closed: If secret is not configured, ALL requests are rejected with 503. No bypass possible. Si el secret no esta configurado, TODAS las peticiones son rechazadas con 503. Sin bypass posible.

🔑 Layer 2: Bearer Token Authentication Capa 2: Autenticacion por Bearer Token

Admin API endpoints protected by Bearer tokens verified against environment secrets. Authorization: Bearer <token> header required. Endpoints admin de la API protegidos por Bearer tokens verificados contra secrets del entorno. Header Authorization: Bearer <token> requerido.

Fail-closed: Empty token = all requests rejected. No development bypass. Token vacio = todas las peticiones rechazadas. Sin bypass de desarrollo.

👤 Layer 3: JWT + Role-Based Access Control Capa 3: JWT + Control de Acceso por Roles

Frontend and API use Supabase Auth JWTs with 4-tier RBAC: admin, client_manager, partner, viewer. Each role has strict data isolation. Frontend y API usan JWTs de Supabase Auth con RBAC de 4 niveles: admin, client_manager, partner, viewer. Cada rol tiene aislamiento estricto de datos.

Partners never see revenue Partners nunca ven revenue only their commission amounts. Multi-tenant isolation enforced at query level. solo sus montos de comision. Aislamiento multi-tenant forzado a nivel de query.

🛡️ Layer 4: Row-Level Security (PostgreSQL) Capa 4: Seguridad a Nivel de Fila (PostgreSQL)

Supabase RLS policies enforce data isolation at the database level. Even if application code has a bug, the database rejects unauthorized access. Las politicas RLS de Supabase fuerzan aislamiento de datos a nivel de base de datos. Incluso si el codigo de aplicacion tiene un bug, la base de datos rechaza acceso no autorizado.

Audit logs track every admin action with user_id, action, resource, ip_address. Los logs de auditoria rastrean cada accion admin con user_id, action, resource, ip_address.

5-Signal Fraud Scoring Engine Motor de Scoring de Fraude de 5 Senales
Every event is scored for fraud risk using 5 independent signals. Scored in 2 microseconds using compiled XGBoost models (Timber → C99 → Go via cgo). Cada evento recibe un score de riesgo de fraude usando 5 senales independientes. Evaluado en 2 microsegundos usando modelos XGBoost compilados (Timber → C99 → Go via cgo).

CTIT Analysis Analisis CTIT

Click-to-Install Time analysis. Our own click timestamp (from click wrapper) vs conversion time. Catches click injection and click flooding. Analisis de tiempo Click-a-Instalacion. Nuestro propio timestamp de click (del click wrapper) vs tiempo de conversion. Detecta inyeccion y flooding de clicks.

🌐 Geo Mismatch Inconsistencia Geo

Click geo vs install geo comparison. Flags VPN-based fraud where clicks come from different countries than actual installs. Comparacion de geo de click vs geo de instalacion. Detecta fraude basado en VPN donde clicks vienen de paises diferentes a las instalaciones reales.

📱 Device Fingerprint Huella de Dispositivo

Cross-references device properties, OS version, screen resolution. Detects emulators, device farms, and spoofed device IDs. Cruza propiedades del dispositivo, version de OS, resolucion de pantalla. Detecta emuladores, granjas de dispositivos e IDs falsificados.

📈 Velocity Checks Verificacion de Velocidad

Clicks and installs per device per hour/day. Abnormal burst patterns indicate bot activity or click flooding attacks. Clicks e instalaciones por dispositivo por hora/dia. Patrones de rafaga anormales indican actividad de bots o ataques de click flooding.

🧠 ML Ensemble

XGBoost model trained on historical fraud data. Combines all signals into final 0-255 score. >200 = flagged, >240 = blocked. Modelo XGBoost entrenado con datos historicos de fraude. Combina todas las senales en un score final 0-255. >200 = marcado, >240 = bloqueado.

Fraud Score Scale Escala de Score de Fraude

0 ——————— 200
FLAGGED MARCADO
BLOCKED BLOQUEADO
0-200: Normal traffic, processed normally Trafico normal, procesado normalmente 201-240: Flagged for manual review Marcado para revision manual 241-255: Blocked, excluded from commissions Bloqueado, excluido de comisiones

🛡️ Zero Data Loss Guarantee Garantia de Cero Perdida de Datos

Every event is persisted to at least 2 independent systems before acknowledgment. Even total server destruction results in zero data loss with full recovery in under 4 hours. Cada evento se persiste en al menos 2 sistemas independientes antes de confirmacion. Incluso la destruccion total del servidor resulta en cero perdida de datos con recuperacion completa en menos de 4 horas.

<50ms
Edge Latency P99 Latencia Edge P99
0
Data Loss (any scenario) Perdida de Datos (cualquier escenario)
8.6x
Data Compression Compresion de Datos
99.99%
Uptime Target
<2 min
Failover Time Tiempo de Failover
3
Redundant Storage Almacenamiento Redundante