Infrastructure Architecture Arquitectura de Infraestructura

How RELO guarantees zero data loss, sub-50ms latency, and automatic recovery from any failure — including total server destruction. Como RELO garantiza cero perdida de datos, latencia <50ms y recuperacion automatica ante cualquier falla — incluyendo destruccion total del servidor.

● Zero Data Loss ● Cero Perdida de Datos
● <50ms Edge Latency
● <2 min Recovery ● <2 min Recuperacion
● 99.99% Uptime
Three-Tier Architecture Arquitectura de Tres Niveles
Every event passes through three independent layers. Each layer can fail independently without losing a single event. Cada evento pasa por tres capas independientes. Cada capa puede fallar de forma independiente sin perder un solo evento.
Tier 1 — Edge Network (300+ Global PoPs) Nivel 1 — Red Edge (300+ PoPs Globales)
🔗

Click Wrapper

t.relo.mx — ULID generation, first-party cookie, 302 redirect. <50ms P99. t.relo.mx — Generacion ULID, cookie first-party, 302 redirect. <50ms P99.

🔎

Web Pixel

p.relo.mx — 143-line JS loader. Beacon API with device fingerprinting. p.relo.mx — Loader JS de 143 lineas. Beacon API con fingerprinting.

📡

S2S Proxy

s2s.relo.mx — HMAC verification, TLS 1.3 termination, DDoS protection. s2s.relo.mx — Verificacion HMAC, terminacion TLS 1.3, proteccion DDoS.

API Gateway

relo-api — 250+ endpoints, JWT auth, CORS, rate limiting. relo-api — 250+ endpoints, auth JWT, CORS, rate limiting.

▼ ▼ ▼
3-TIER FAILOVER: Primary → Standby → Object Store DLQ FAILOVER 3 NIVELES: Primario → Standby → Object Store DLQ
Tier 2 — Processing Engine (Dedicated Server) Nivel 2 — Motor de Procesamiento (Servidor Dedicado)
⚙️

Go Ingest Service

6,684 lines, 16 packages. fasthttp, batch writer, identity resolution, fraud scoring. 6,684 lineas, 16 paquetes. fasthttp, escritura batch, resolucion de identidad, scoring de fraude.

🗃️

ClickHouse

Columnar analytics engine. 35-field schema, 8x compression, 24-month TTL. Motor analitico columnar. Schema de 35 campos, 8x compresion, TTL de 24 meses.

🧠

DragonflyDB

Identity graph, fraud cache, real-time counters. Sub-ms lookups. 29x faster than Redis. Grafo de identidad, cache de fraude, contadores en tiempo real. Lookups sub-ms. 29x mas rapido que Redis.

▼ ▼ ▼
Dual-write + Aggregator sync every 10 min Escritura dual + Sincronizacion de agregados cada 10 min
Tier 3 — Platform & Activation Nivel 3 — Plataforma y Activacion
🏛️

Supabase PostgreSQL

50+ tables. Config, partners, payments, aggregates. Constant-size forever. 50+ tablas. Configuracion, partners, pagos, agregados. Tamano constante por siempre.

📦

Cloudflare R2

Data lake, CSV archive, backups. 11 nines durability. Zero egress fees. Data lake, archivo CSV, backups. 11 nueves de durabilidad. Cero costo de egress.

🎯

DSP Activation

Meta CAPI, Google Ads, TikTok API. Real-time purchase event forwarding. Meta CAPI, Google Ads, TikTok API. Envio de eventos de compra en tiempo real.

Event Lifecycle — Click to Dashboard Ciclo de Vida del Evento — Del Click al Dashboard
Every event is persisted to at least 2 independent storage systems before acknowledgment. Here's exactly how data flows through the system. Cada evento se persiste en al menos 2 sistemas de almacenamiento independientes antes de confirmacion. Asi es exactamente como fluyen los datos.
StepWhat HappensStorage WrittenLatency
1. Event arrivesClick, pixel beacon, or S2S postback hits Cloudflare edge workerEdge memory (in-flight)<5ms
2. Edge processingULID generated, cookies set, geo headers captured, forwarded to Go service<10ms
3. Go ingestHMAC verification, GeoIP enrichment, identity resolution, fraud scoring (2µs ML)<5ms
4. Batch writeEvent buffered, flushed every 1,000 events or 1 second (whichever first)ClickHouse (NVMe RAID-1)<1s
5. Dual-writePurchase events also written to Supabase for real-time dashboard queriesSupabase (Cloud PostgreSQL)<100ms
6. Identity updateDevice profile updated with new event, cross-device links maintainedDragonflyDB (in-memory, 90-day TTL)<1ms
7. AggregationClickHouse materialized views auto-refresh, sync to Supabase every 10 minSupabase (aggregates table)5-10 min
8. DSP exportPurchase events batched and sent to Meta CAPI within 5 secondsMeta servers (external)<5s
PasoQue SucedeAlmacenamientoLatencia
1. Evento llegaClick, beacon de pixel o postback S2S llega al edge worker de CloudflareMemoria edge (en vuelo)<5ms
2. Procesamiento edgeULID generado, cookies seteadas, headers geo capturados, reenviado a servicio Go<10ms
3. Ingesta GoVerificacion HMAC, enriquecimiento GeoIP, resolucion de identidad, scoring de fraude (2µs ML)<5ms
4. Escritura batchEvento en buffer, flush cada 1,000 eventos o 1 segundo (lo que ocurra primero)ClickHouse (NVMe RAID-1)<1s
5. Escritura dualEventos de compra tambien escritos en Supabase para consultas de dashboard en tiempo realSupabase (Cloud PostgreSQL)<100ms
6. Actualizacion identidadPerfil de dispositivo actualizado con nuevo evento, enlaces cross-device mantenidosDragonflyDB (en memoria, TTL 90 dias)<1ms
7. AgregacionVistas materializadas de ClickHouse se refrescan automaticamente, sincronizan a Supabase cada 10 minSupabase (tabla de agregados)5-10 min
8. Exportacion DSPEventos de compra agrupados y enviados a Meta CAPI en menos de 5 segundosServidores Meta (externo)<5s

🔒 Data Durability at Every Step Durabilidad de Datos en Cada Paso

At step 4, the event exists in ClickHouse (RAID-1 NVMe). At step 5, purchase events also exist in Supabase (cloud-replicated). At step 7, aggregated data is in Supabase. Daily backups copy everything to R2 (11 nines). En el paso 4, el evento existe en ClickHouse (NVMe RAID-1). En el paso 5, los eventos de compra tambien existen en Supabase (replicado en la nube). En el paso 7, los datos agregados estan en Supabase. Backups diarios copian todo a R2 (11 nueves).

3
Independent Storage Systems Sistemas de Almacenamiento Independientes
11
Nines Durability (R2) Nueves de Durabilidad (R2)
RAID-1
Mirrored NVMe Drives Discos NVMe en Espejo
24 mo
Automatic Retention Retencion Automatica
What Happens When Things Break Que Pasa Cuando Algo Falla
Every data source has a multi-tier failover chain. If the primary path fails, the next tier catches the event. No event is ever dropped. Cada fuente de datos tiene una cadena de failover multi-nivel. Si la ruta primaria falla, el siguiente nivel captura el evento. Ningun evento se pierde jamas.
🔗 Click Wrapper — 3-Tier Failover Click Wrapper — Failover de 3 Niveles
1

Go Ingest (Primary)

Direct POST to Go service on dedicated server. Normal operation path. POST directo al servicio Go en servidor dedicado. Ruta de operacion normal.

<50ms
2

Cloudflare KV (Fallback)

If Go is down, click data written to KV with 24-hour TTL. Go recovers it on restart. Si Go esta caido, datos de click escritos en KV con TTL de 24 horas. Go los recupera al reiniciar.

~30s recovery ~30s recuperacion
3

R2 Dead Letter Queue

Last resort: event written as JSON to R2 object store. Permanent, 11 nines durability. Ultimo recurso: evento escrito como JSON en R2 object store. Permanente, 11 nueves de durabilidad.

~60s recovery ~60s recuperacion
🔎 Web Pixel — 2-Tier Failover Web Pixel — Failover de 2 Niveles
1

Go Ingest (Primary)

Beacon events forwarded to Go service for processing and ClickHouse write. Eventos beacon reenviados al servicio Go para procesamiento y escritura en ClickHouse.

<100ms
2

R2 Dead Letter Queue

If Go is unreachable, pixel events buffered to R2. Recovered on service restore. Si Go no esta disponible, eventos de pixel almacenados en R2. Recuperados al restaurar el servicio.

~60s recovery ~60s recuperacion
📡 S2S Postbacks — Primary + Warm Standby Postbacks S2S — Primario + Standby en Caliente
1

Go Ingest (Primary)

S2S postbacks from AppsFlyer/Branch. HMAC verified, enriched, written to ClickHouse. Postbacks S2S de AppsFlyer/Branch. HMAC verificado, enriquecido, escrito en ClickHouse.

<200ms
2

Warm Standby VPS

Always-on receiver on separate server. Buffers events to disk as JSON files. Replayed after recovery. Receptor siempre activo en servidor separado. Almacena eventos en disco como JSON. Se reproducen tras la recuperacion.

<2 min failover
🔁 AppsFlyer Pull API — Hourly Batch AppsFlyer Pull API — Batch por Hora

Hourly Pull (Primary) Pull Cada Hora (Primario)

Systemd timer pulls all events from AppsFlyer API every 60 minutes. Replaces S2S for most data. Timer de systemd extrae todos los eventos de la API de AppsFlyer cada 60 minutos. Reemplaza S2S para la mayoria de datos.

60 min interval Intervalo de 60 min
💾

AppsFlyer Data Retention Retencion de Datos AppsFlyer

AppsFlyer retains all raw data for 90 days. If our pull fails, we can always re-pull historical data. AppsFlyer retiene todos los datos crudos por 90 dias. Si nuestro pull falla, siempre podemos re-extraer datos historicos.

90-day window Ventana de 90 dias
Every Worst-Case Scenario — Covered Cada Peor Escenario — Cubierto
We've planned for total server destruction, datacenter fires, provider outages, and more. Here's exactly what happens in each scenario. Hemos planeado para destruccion total del servidor, incendios en datacenter, caidas de proveedores y mas. Esto es exactamente lo que pasa en cada escenario.
🔥 Server catches fire 🔥 El servidor se incendia
Dedicated server (ClickHouse, Go, DragonflyDB) completely destroyed. Servidor dedicado (ClickHouse, Go, DragonflyDB) completamente destruido.
  1. Edge workers detect health check failure (3 consecutive, 30s apart)
  2. Click wrapper falls back to KV → R2 DLQ chain
  3. Pixel falls back to R2 DLQ
  4. S2S switches to warm standby (buffers to disk)
  5. Spin up new server, restore from R2 daily backup
  6. Replay standby buffer + R2 DLQ events into new ClickHouse
  7. DragonflyDB rebuilt from ClickHouse data
  1. Edge workers detectan falla de health check (3 consecutivos, 30s entre cada uno)
  2. Click wrapper cae a cadena KV → R2 DLQ
  3. Pixel cae a R2 DLQ
  4. S2S cambia a standby en caliente (almacena en disco)
  5. Levantar nuevo servidor, restaurar desde backup diario en R2
  6. Reproducir buffer de standby + eventos R2 DLQ en nuevo ClickHouse
  7. DragonflyDB reconstruido desde datos de ClickHouse
✓ Zero data loss — RTO: 2-4 hours ✓ Cero perdida de datos — RTO: 2-4 horas
🔌 Cloudflare goes down 🔌 Cloudflare se cae
Edge workers, KV, R2, and Pipelines all unavailable. Edge workers, KV, R2 y Pipelines todos no disponibles.
  1. No clicks, pixels, or API requests reach the system
  2. Go service still running, but no new events arriving
  3. AppsFlyer Pull API continues (direct to Hetzner via tunnel)
  4. When CF recovers, all queued events process normally
  5. Any events during outage still in AppsFlyer (90-day retention)
  1. Ningun click, pixel o peticion API llega al sistema
  2. Servicio Go sigue corriendo, pero no llegan nuevos eventos
  3. AppsFlyer Pull API continua (directo a Hetzner via tunnel)
  4. Cuando CF se recupera, todos los eventos encolados se procesan normalmente
  5. Cualquier evento durante la caida sigue en AppsFlyer (retencion 90 dias)
✓ Zero data loss — data recovered from AppsFlyer ✓ Cero perdida de datos — datos recuperados de AppsFlyer
💣 Supabase goes down 💣 Supabase se cae
Platform DB (config, payments, aggregates) unavailable. DB de plataforma (config, pagos, agregados) no disponible.
  1. Dashboard and portal temporarily unavailable
  2. ClickHouse continues receiving and storing ALL events normally
  3. DragonflyDB identity graph continues working
  4. Events keep flowing — nothing lost
  5. When Supabase recovers, aggregator syncs 7-day lookback window
  1. Dashboard y portal temporalmente no disponibles
  2. ClickHouse sigue recibiendo y almacenando TODOS los eventos normalmente
  3. Grafo de identidad de DragonflyDB sigue funcionando
  4. Los eventos siguen fluyendo — nada se pierde
  5. Cuando Supabase se recupera, el agregador sincroniza ventana de 7 dias
✓ Zero data loss — events buffered in ClickHouse ✓ Cero perdida de datos — eventos almacenados en ClickHouse
⚡ Go service crashes ⚡ Servicio Go se cae
Ingest service stops accepting events. Servicio de ingesta deja de aceptar eventos.
  1. Systemd auto-restarts service within 5 seconds
  2. During restart: clicks → KV fallback → R2 DLQ
  3. During restart: pixels → R2 DLQ
  4. Health check runs every 5 minutes, alerts on failure
  5. On restart, Go recovers KV and R2 DLQ events automatically
  1. Systemd reinicia automaticamente el servicio en menos de 5 segundos
  2. Durante el reinicio: clicks → fallback KV → R2 DLQ
  3. Durante el reinicio: pixels → R2 DLQ
  4. Health check corre cada 5 minutos, alerta en caso de falla
  5. Al reiniciar, Go recupera eventos de KV y R2 DLQ automaticamente
✓ Zero data loss — RTO: <10 seconds ✓ Cero perdida de datos — RTO: <10 segundos
💾 ClickHouse disk failure 💾 Falla de disco ClickHouse
One NVMe drive dies. Un disco NVMe muere.
  1. RAID-1 continues operating on the surviving drive
  2. No data loss, no downtime — transparent failover
  3. Health check detects degraded RAID, sends alert
  4. Replace failed drive, RAID rebuilds automatically
  1. RAID-1 continua operando con el disco sobreviviente
  2. Sin perdida de datos, sin downtime — failover transparente
  3. Health check detecta RAID degradado, envia alerta
  4. Reemplazar disco fallido, RAID se reconstruye automaticamente
✓ Zero data loss, zero downtime ✓ Cero perdida de datos, cero downtime
🔐 DragonflyDB crashes 🔐 DragonflyDB se cae
Identity graph and real-time counters lost from memory. Grafo de identidad y contadores en tiempo real perdidos de memoria.
  1. Systemd restarts DragonflyDB immediately
  2. Identity graph rebuilt from ClickHouse events (scripted)
  3. Events continue flowing to ClickHouse unaffected
  4. Fraud scores temporarily unavailable (events still stored)
  5. Full rebuild takes ~30 minutes for 90 days of data
  1. Systemd reinicia DragonflyDB inmediatamente
  2. Grafo de identidad reconstruido desde eventos de ClickHouse (automatizado)
  3. Eventos siguen fluyendo a ClickHouse sin afectacion
  4. Scores de fraude temporalmente no disponibles (eventos se siguen almacenando)
  5. Reconstruccion completa toma ~30 minutos para 90 dias de datos
⚠ Temporary degradation — zero data loss ⚠ Degradacion temporal — cero perdida de datos
Automated Daily + Weekly Backups to R2 Backups Automaticos Diarios + Semanales a R2
ClickHouse data is automatically exported and uploaded to Cloudflare R2 (11 nines durability) on a scheduled basis. No manual intervention required. Los datos de ClickHouse se exportan automaticamente y se suben a Cloudflare R2 (11 nueves de durabilidad) de forma programada. Sin intervencion manual.
Every 5 min Cada 5 min
💚

Health Check

Verifies Go service, ClickHouse, and DragonflyDB are responsive. Alerts on failure via webhook. Verifica que servicio Go, ClickHouse y DragonflyDB esten respondiendo. Alerta en caso de falla via webhook.

Every 10 min Cada 10 min
🔄

Aggregator Sync Sincronizacion de Agregados

ClickHouse aggregated data synced to Supabase partner_daily_aggregates. 7-day lookback window. Datos agregados de ClickHouse sincronizados a Supabase partner_daily_aggregates. Ventana de 7 dias.

Daily 4:00 AM Diario 4:00 AM
💾

Incremental Backup Backup Incremental

Last 7 days of events + all aggregate tables exported and uploaded to R2 via rclone. Typical size: ~20KB compressed. Ultimos 7 dias de eventos + todas las tablas de agregados exportadas y subidas a R2 via rclone. Tamano tipico: ~20KB comprimido.

Weekly (Sun) Semanal (Dom)
🗃️

Full Backup Backup Completo

Complete export of ALL tables: events, clicks, partner_daily_aggregates, audience_members, audience_segments, user_features, campaign_costs. Uploaded to R2. Exportacion completa de TODAS las tablas: events, clicks, partner_daily_aggregates, audience_members, audience_segments, user_features, campaign_costs. Subido a R2.

Every 60 min Cada 60 min
🔁

AppsFlyer Pull

Full event data pulled from AppsFlyer API. Serves as independent external backup — 90-day retention on AppsFlyer's side. Datos completos de eventos extraidos de la API de AppsFlyer. Sirve como backup externo independiente — retencion de 90 dias del lado de AppsFlyer.

📋 Backup Verification (actual log output) 📋 Verificacion de Backups (salida real de logs)

$ crontab -l
*/5 * * * * /opt/relo-ingest/health-check.sh
0 4 * * *   /opt/relo-ingest/backup-clickhouse.sh --daily
0 4 * * 0   /opt/relo-ingest/backup-clickhouse.sh --full

$ tail -5 /var/log/ch-backup.log
[2026-03-12 04:00:01] Starting daily backup...
[2026-03-12 04:00:02] Exported events (last 7d): 1,247 rows
[2026-03-12 04:00:02] Exported partner_daily_aggregates: 892 rows
[2026-03-12 04:00:03] Uploading to R2: relo-backups/daily/2026-03-12/
[2026-03-12 04:00:04] ✓ Backup complete. Size: 17.6KB
Four Layers of Authentication Cuatro Capas de Autenticacion
Every request to RELO passes through multiple authentication checks. All authentication follows a fail-closed pattern — if in doubt, reject. Cada peticion a RELO pasa por multiples verificaciones de autenticacion. Toda la autenticacion sigue un patron fail-closed — en caso de duda, rechazar.

🔒 Layer 1: HMAC-SHA256 Webhook Signatures Capa 1: Firmas HMAC-SHA256 para Webhooks

All S2S postbacks from AppsFlyer are verified using HMAC-SHA256 signatures. The Go middleware computes HMAC(body, secret) and compares against X-AF-Signature header. Todos los postbacks S2S de AppsFlyer se verifican usando firmas HMAC-SHA256. El middleware Go calcula HMAC(body, secret) y compara contra el header X-AF-Signature.

Fail-closed: If secret is not configured, ALL requests are rejected with 503. No bypass possible. Si el secret no esta configurado, TODAS las peticiones son rechazadas con 503. Sin bypass posible.

🔑 Layer 2: Bearer Token Authentication Capa 2: Autenticacion por Bearer Token

Admin API endpoints protected by Bearer tokens verified against environment secrets. Authorization: Bearer <token> header required. Endpoints admin de la API protegidos por Bearer tokens verificados contra secrets del entorno. Header Authorization: Bearer <token> requerido.

Fail-closed: Empty token = all requests rejected. No development bypass. Token vacio = todas las peticiones rechazadas. Sin bypass de desarrollo.

👤 Layer 3: JWT + Role-Based Access Control Capa 3: JWT + Control de Acceso por Roles

Frontend and API use Supabase Auth JWTs with 4-tier RBAC: admin, client_manager, partner, viewer. Each role has strict data isolation. Frontend y API usan JWTs de Supabase Auth con RBAC de 4 niveles: admin, client_manager, partner, viewer. Cada rol tiene aislamiento estricto de datos.

Partners never see revenue Partners nunca ven revenue only their commission amounts. Multi-tenant isolation enforced at query level. solo sus montos de comision. Aislamiento multi-tenant forzado a nivel de query.

🛡️ Layer 4: Row-Level Security (PostgreSQL) Capa 4: Seguridad a Nivel de Fila (PostgreSQL)

Supabase RLS policies enforce data isolation at the database level. Even if application code has a bug, the database rejects unauthorized access. Las politicas RLS de Supabase fuerzan aislamiento de datos a nivel de base de datos. Incluso si el codigo de aplicacion tiene un bug, la base de datos rechaza acceso no autorizado.

Audit logs track every admin action with user_id, action, resource, ip_address. Los logs de auditoria rastrean cada accion admin con user_id, action, resource, ip_address.

5-Signal Fraud Scoring Engine Motor de Scoring de Fraude de 5 Senales
Every event is scored for fraud risk using 5 independent signals. Scored in 2 microseconds using compiled XGBoost models (Timber → C99 → Go via cgo). Cada evento recibe un score de riesgo de fraude usando 5 senales independientes. Evaluado en 2 microsegundos usando modelos XGBoost compilados (Timber → C99 → Go via cgo).

CTIT Analysis Analisis CTIT

Click-to-Install Time analysis. Our own click timestamp (from click wrapper) vs conversion time. Catches click injection and click flooding. Analisis de tiempo Click-a-Instalacion. Nuestro propio timestamp de click (del click wrapper) vs tiempo de conversion. Detecta inyeccion y flooding de clicks.

🌐 Geo Mismatch Inconsistencia Geo

Click geo vs install geo comparison. Flags VPN-based fraud where clicks come from different countries than actual installs. Comparacion de geo de click vs geo de instalacion. Detecta fraude basado en VPN donde clicks vienen de paises diferentes a las instalaciones reales.

📱 Device Fingerprint Huella de Dispositivo

Cross-references device properties, OS version, screen resolution. Detects emulators, device farms, and spoofed device IDs. Cruza propiedades del dispositivo, version de OS, resolucion de pantalla. Detecta emuladores, granjas de dispositivos e IDs falsificados.

📈 Velocity Checks Verificacion de Velocidad

Clicks and installs per device per hour/day. Abnormal burst patterns indicate bot activity or click flooding attacks. Clicks e instalaciones por dispositivo por hora/dia. Patrones de rafaga anormales indican actividad de bots o ataques de click flooding.

🧠 ML Ensemble

XGBoost model trained on historical fraud data. Combines all signals into final 0-255 score. >200 = flagged, >240 = blocked. Modelo XGBoost entrenado con datos historicos de fraude. Combina todas las senales en un score final 0-255. >200 = marcado, >240 = bloqueado.

Fraud Score Scale Escala de Score de Fraude

0 ——————— 200
FLAGGED MARCADO
BLOCKED BLOQUEADO
0-200: Normal traffic, processed normally Trafico normal, procesado normalmente 201-240: Flagged for manual review Marcado para revision manual 241-255: Blocked, excluded from commissions Bloqueado, excluido de comisiones

🛡️ Zero Data Loss Guarantee Garantia de Cero Perdida de Datos

Every event is persisted to at least 2 independent systems before acknowledgment. Even total server destruction results in zero data loss with full recovery in under 4 hours. Cada evento se persiste en al menos 2 sistemas independientes antes de confirmacion. Incluso la destruccion total del servidor resulta en cero perdida de datos con recuperacion completa en menos de 4 horas.

<50ms
Edge Latency P99 Latencia Edge P99
0
Data Loss (any scenario) Perdida de Datos (cualquier escenario)
8-10x
Data Compression Compresion de Datos
99.99%
Uptime Target
<2 min
Failover Time Tiempo de Failover
3
Redundant Storage Almacenamiento Redundante