How RELO guarantees zero data loss, sub-50ms latency, and automatic recovery from any failure — including total server destruction. Como RELO garantiza cero perdida de datos, latencia <50ms y recuperacion automatica ante cualquier falla — incluyendo destruccion total del servidor.
t.relo.mx — ULID generation, first-party cookie, 302 redirect. <50ms P99. t.relo.mx — Generacion ULID, cookie first-party, 302 redirect. <50ms P99.
p.relo.mx — 143-line JS loader. Beacon API with device fingerprinting. p.relo.mx — Loader JS de 143 lineas. Beacon API con fingerprinting.
s2s.relo.mx — HMAC verification, TLS 1.3 termination, DDoS protection. s2s.relo.mx — Verificacion HMAC, terminacion TLS 1.3, proteccion DDoS.
relo-api — 250+ endpoints, JWT auth, CORS, rate limiting. relo-api — 250+ endpoints, auth JWT, CORS, rate limiting.
6,684 lines, 16 packages. fasthttp, batch writer, identity resolution, fraud scoring. 6,684 lineas, 16 paquetes. fasthttp, escritura batch, resolucion de identidad, scoring de fraude.
Columnar analytics engine. 35-field schema, 8.6x compression, 24-month TTL. 12 tables incl. 6 MVs. Motor analitico columnar. Schema de 35 campos, 8.6x compresion, TTL de 24 meses. 12 tablas incl. 6 MVs.
Identity graph, fraud cache, real-time counters. Sub-ms lookups. 29x faster than Redis. Grafo de identidad, cache de fraude, contadores en tiempo real. Lookups sub-ms. 29x mas rapido que Redis.
50+ tables. Config, partners, payments, aggregates. Constant-size forever. 50+ tablas. Configuracion, partners, pagos, agregados. Tamano constante por siempre.
Data lake, CSV archive, backups. 11 nines durability. Zero egress fees. Data lake, archivo CSV, backups. 11 nueves de durabilidad. Cero costo de egress.
Meta CAPI, Google Ads, TikTok API. Real-time purchase event forwarding. Meta CAPI, Google Ads, TikTok API. Envio de eventos de compra en tiempo real.
| Step | What Happens | Storage Written | Latency |
|---|---|---|---|
| 1. Event arrives | Click, pixel beacon, or S2S postback hits Cloudflare edge worker | Edge memory (in-flight) | <5ms |
| 2. Edge processing | ULID generated, cookies set, geo headers captured, forwarded to Go service | — | <10ms |
| 3. R2 WAL | Event written to R2 write-ahead log as gzipped NDJSON before processing. Key format: wal/events/YYYY/MM/DD/HH/{ulid}.ndjson.gz. Guarantees zero data loss even if Go service crashes mid-flight. Cost: ~$2-3/mo. | R2 (11 nines durability) | <20ms |
| 4. Go ingest | HMAC verification, GeoIP enrichment, identity resolution, fraud scoring (2µs ML) | — | <5ms |
| 5. Batch write | Event buffered, flushed every 1,000 events or 1 second (whichever first) | ClickHouse (NVMe RAID-1) | <1s |
| 6. Dual-write | Purchase events also written to Supabase for real-time dashboard queries | Supabase (Cloud PostgreSQL) | <100ms |
| 7. Identity update | Device profile updated with new event, cross-device links maintained | DragonflyDB (in-memory, 90-day TTL) | <1ms |
| 8. Aggregation | ClickHouse MVs auto-refresh every 5 min. Aggregator syncs to Supabase every 10 min (partner_daily_aggregates, geo_daily_aggregates, media_source_daily, hourly_stats). Dashboard reads from aggregates; Orders tab uses direct ClickHouse proxy via Go endpoint. | Supabase (4 aggregate tables) | 5-10 min |
| 9. DLQ recovery | Dead Letter Queue checks every 60s for failed events and auto-replays them to the Go service. R2 replay tool available for full disaster recovery. | R2 DLQ | 60s cycle |
| 10. DSP export | Purchase events batched and sent to Meta CAPI within 5 seconds | Meta servers (external) | <5s |
| Paso | Que Sucede | Almacenamiento | Latencia |
|---|---|---|---|
| 1. Evento llega | Click, beacon de pixel o postback S2S llega al edge worker de Cloudflare | Memoria edge (en vuelo) | <5ms |
| 2. Procesamiento edge | ULID generado, cookies seteadas, headers geo capturados, reenviado a servicio Go | — | <10ms |
| 3. R2 WAL | Evento escrito en el write-ahead log de R2 como NDJSON comprimido con gzip antes de procesarse. Key: wal/events/YYYY/MM/DD/HH/{ulid}.ndjson.gz. Garantiza cero perdida de datos aunque el servicio Go falle a mitad de vuelo. Costo: ~$2-3/mes. | R2 (11 nueves de durabilidad) | <20ms |
| 4. Ingesta Go | Verificacion HMAC, enriquecimiento GeoIP, resolucion de identidad, scoring de fraude (2µs ML) | — | <5ms |
| 5. Escritura batch | Evento en buffer, flush cada 1,000 eventos o 1 segundo (lo que ocurra primero) | ClickHouse (NVMe RAID-1) | <1s |
| 6. Escritura dual | Eventos de compra tambien escritos en Supabase para consultas de dashboard en tiempo real | Supabase (Cloud PostgreSQL) | <100ms |
| 7. Actualizacion identidad | Perfil de dispositivo actualizado con nuevo evento, enlaces cross-device mantenidos | DragonflyDB (en memoria, TTL 90 dias) | <1ms |
| 8. Agregacion | MVs de ClickHouse se refrescan cada 5 min. Agregador sincroniza a Supabase cada 10 min (partner_daily_aggregates, geo_daily_aggregates, media_source_daily, hourly_stats). Dashboard lee de agregados; tab de Ordenes usa proxy directo a ClickHouse via endpoint Go. | Supabase (4 tablas de agregados) | 5-10 min |
| 9. Recuperacion DLQ | Dead Letter Queue revisa cada 60s eventos fallidos y los replaya al servicio Go. Herramienta de replay R2 disponible para recuperacion completa de desastres. | R2 DLQ | ciclo 60s |
| 10. Exportacion DSP | Eventos de compra agrupados y enviados a Meta CAPI en menos de 5 segundos | Servidores Meta (externo) | <5s |
At step 3, the event is in R2 WAL (11 nines). At step 5, it exists in ClickHouse (RAID-1 NVMe). At step 6, purchases also exist in Supabase (cloud-replicated). At step 8, aggregated data feeds the dashboard from 4 aggregate tables (partner_daily, geo_daily, media_source_daily, hourly_stats). Step 9 auto-recovers any failed events from DLQ. Daily backups copy everything to R2. En el paso 3, el evento esta en R2 WAL (11 nueves). En el paso 5, existe en ClickHouse (NVMe RAID-1). En el paso 6, las compras tambien existen en Supabase (replicado en nube). En el paso 8, los datos agregados alimentan el dashboard desde 4 tablas de agregados (partner_daily, geo_daily, media_source_daily, hourly_stats). El paso 9 auto-recupera eventos fallidos de la DLQ. Backups diarios copian todo a R2.
Direct POST to Go service on dedicated server. Normal operation path. POST directo al servicio Go en servidor dedicado. Ruta de operacion normal.
If Go is down, click data written to KV with 24-hour TTL. Go recovers it on restart. Si Go esta caido, datos de click escritos en KV con TTL de 24 horas. Go los recupera al reiniciar.
Last resort: event written as JSON to R2 object store. Permanent, 11 nines durability. Ultimo recurso: evento escrito como JSON en R2 object store. Permanente, 11 nueves de durabilidad.
Beacon events forwarded to Go service for processing and ClickHouse write. Eventos beacon reenviados al servicio Go para procesamiento y escritura en ClickHouse.
If Go primary is down, pixel events forwarded to VPS warm standby. Si Go primario esta caido, eventos de pixel reenviados a VPS standby.
Last resort: pixel events buffered to R2 DLQ. Auto-replay every 60s. Ultimo recurso: eventos de pixel almacenados en R2 DLQ. Auto-replay cada 60s.
S2S postbacks from AppsFlyer/Branch. HMAC verified, enriched, written to ClickHouse. Postbacks S2S de AppsFlyer/Branch. HMAC verificado, enriquecido, escrito en ClickHouse.
Always-on receiver on separate server. Buffers events to disk as JSON files. Replayed after recovery. Receptor siempre activo en servidor separado. Almacena eventos en disco como JSON. Se reproducen tras la recuperacion.
If both primary and standby are down, events buffered to R2 DLQ. Auto-replay every 60s. Si primario y standby estan caidos, eventos almacenados en R2 DLQ. Auto-replay cada 60s.
Systemd timer pulls all events from AppsFlyer API every 60 minutes. Replaces S2S for most data. Timer de systemd extrae todos los eventos de la API de AppsFlyer cada 60 minutos. Reemplaza S2S para la mayoria de datos.
AppsFlyer retains all raw data for 90 days. If our pull fails, we can always re-pull historical data. AppsFlyer retiene todos los datos crudos por 90 dias. Si nuestro pull falla, siempre podemos re-extraer datos historicos.
4 probes (ClickHouse, DragonflyDB, R2, Supabase). Auto-restarts failed services after 3 consecutive failures. Alerts via email. 4 probes (ClickHouse, DragonflyDB, R2, Supabase). Auto-reinicia servicios fallidos despues de 3 fallos consecutivos. Alertas por email.
ClickHouse aggregated data synced to Supabase partner_daily_aggregates. 7-day lookback window. Datos agregados de ClickHouse sincronizados a Supabase partner_daily_aggregates. Ventana de 7 dias.
Last 7 days of events + all aggregate tables exported and uploaded to R2 via rclone. Typical size: ~20KB compressed. Ultimos 7 dias de eventos + todas las tablas de agregados exportadas y subidas a R2 via rclone. Tamano tipico: ~20KB comprimido.
Complete export of ALL tables: events, clicks, partner_daily_aggregates, audience_members, audience_segments, user_features, campaign_costs. Uploaded to R2. Exportacion completa de TODAS las tablas: events, clicks, partner_daily_aggregates, audience_members, audience_segments, user_features, campaign_costs. Subido a R2.
Full event data pulled from AppsFlyer API. Serves as independent external backup — 90-day retention on AppsFlyer's side. Datos completos de eventos extraidos de la API de AppsFlyer. Sirve como backup externo independiente — retencion de 90 dias del lado de AppsFlyer.
$ crontab -l
# Health check now runs inside Go service every 60s (self-healing monitor)
0 4 * * * /opt/relo-ingest/backup-clickhouse.sh --daily
0 4 * * 0 /opt/relo-ingest/backup-clickhouse.sh --full
$ tail -5 /var/log/ch-backup.log
[2026-03-12 04:00:01] Starting daily backup...
[2026-03-12 04:00:02] Exported events (last 7d): 1,247 rows
[2026-03-12 04:00:02] Exported partner_daily_aggregates: 892 rows
[2026-03-12 04:00:03] Uploading to R2: relo-backups/daily/2026-03-12/
[2026-03-12 04:00:04] ✓ Backup complete. Size: 17.6KB
SELECT 1 + row count validation. Auto-restarts clickhouse-server after 3 failures.
SELECT 1 + validacion de conteo de filas. Auto-reinicia clickhouse-server despues de 3 fallos.
PING command + test key write/read. Auto-restarts dragonfly service.
Comando PING + escritura/lectura de key de prueba. Auto-reinicia servicio dragonfly.
Object list + write test. No auto-restart (external service) — alert only. Lista de objetos + prueba de escritura. Sin auto-reinicio (servicio externo) — solo alerta.
REST API health check + aggregate sync lag verification. Alert if lag exceeds 30 minutes. Health check de API REST + verificacion de retraso de sincronizacion de agregados. Alerta si el retraso excede 30 minutos.
Go ingest service uses Type=notify with WatchdogSec=120. The service sends heartbeat notifications to systemd every 30 seconds (4x safety margin). If the process hangs (deadlock, infinite loop, memory stall), systemd automatically kills and restarts it within 120 seconds — even if the process is still technically "running." Package: backbone/internal/watchdog/.
El servicio Go de ingesta usa Type=notify con WatchdogSec=120. El servicio envia notificaciones de heartbeat a systemd cada 30 segundos (margen de seguridad 4x). Si el proceso se congela (deadlock, loop infinito, stall de memoria), systemd automaticamente lo mata y reinicia en 120 segundos — aunque el proceso siga tecnicamente "corriendo." Paquete: backbone/internal/watchdog/.
A Cloudflare Worker cron job pings the backbone /health endpoint every hour — completely independent from the Go process itself. If both the primary backbone and VPS standby are unreachable, an alert email is sent. This catches scenarios where the Go self-healing monitor can't report its own failure.
Un cron job de Cloudflare Workers hace ping al endpoint /health del backbone cada hora — completamente independiente del proceso Go. Si tanto el backbone primario como el VPS standby estan inalcanzables, se envia un email de alerta. Esto cubre escenarios donde el monitor auto-reparable del Go no puede reportar su propia falla.
GET /wal/verify samples recent R2 WAL files, decompresses them from gzip, and validates JSON structure of every line. Catches silent corruption (bit rot, truncated writes, compression errors) before it matters. If any file fails validation, the health endpoint reflects degraded status.
GET /wal/verify muestrea archivos recientes del WAL en R2, los descomprime de gzip y valida la estructura JSON de cada linea. Detecta corrupcion silenciosa (bit rot, escrituras truncadas, errores de compresion) antes de que importe. Si algun archivo falla la validacion, el endpoint de salud refleja estado degradado.
The /health endpoint now exposes real-time Dead Letter Queue depth counters for all three event sources: clicks, S2S postbacks, and pixel events. A growing DLQ depth is an early warning that the primary path is failing and events are accumulating in the fallback queue. Enables proactive intervention before data delivery lag becomes visible in dashboards.
El endpoint /health ahora expone contadores en tiempo real de profundidad de Dead Letter Queue para las tres fuentes de eventos: clicks, postbacks S2S y eventos de pixel. Una profundidad de DLQ creciente es una alerta temprana de que la ruta primaria esta fallando y los eventos se estan acumulando en la cola de respaldo. Permite intervencion proactiva antes de que el retraso en entrega de datos sea visible en los dashboards.
Supabase built-in daily backups plus Point-in-Time Recovery (PITR) protect all configuration tables: clients, partners, commissions, budgets, payments, user profiles, and AI knowledge base. PITR enables recovery to any second within the retention window. This complements the ClickHouse/R2 backup strategy by covering the platform database layer. Backups diarios integrados de Supabase mas Point-in-Time Recovery (PITR) protegen todas las tablas de configuracion: clientes, partners, comisiones, presupuestos, pagos, perfiles de usuario y base de conocimiento de IA. PITR permite recuperacion a cualquier segundo dentro de la ventana de retencion. Esto complementa la estrategia de backup de ClickHouse/R2 cubriendo la capa de base de datos de plataforma.
The disaster recovery test script drp-test.sh now validates WAL integrity (decompresses + checks JSON) and verifies DLQ depth counters are at zero during automated DR tests. This ensures that both the write-ahead log and dead letter queues are functioning correctly — not just that services respond to health checks.
El script de prueba de recuperacion ante desastres drp-test.sh ahora valida la integridad del WAL (descomprime + verifica JSON) y comprueba que los contadores de profundidad de DLQ esten en cero durante pruebas automatizadas de DR. Esto asegura que tanto el write-ahead log como las colas de dead letter estan funcionando correctamente — no solo que los servicios respondan a health checks.
| Layer | Mechanism | Catches |
|---|---|---|
| Process hang | Systemd watchdog (120s) | Deadlocks, infinite loops, memory stalls |
| Process crash | Systemd auto-restart (5s) | Panics, OOM kills, segfaults |
| Dependency failure | Self-healing monitor (60s probes) | ClickHouse, DragonflyDB, R2, Supabase down |
| Total backbone down | External CF Worker cron (hourly) | Network outage, datacenter fire, full server loss |
| Silent data corruption | WAL integrity verification | Bit rot, truncated writes, gzip corruption |
| Failover queue buildup | DLQ depth counters on /health | Degraded primary path before visible impact |
| Config DB loss | Supabase PITR + daily backups | Accidental deletes, schema corruption |
| Recovery validation | drp-test.sh (WAL + DLQ checks) | Backup integrity failures, stale DLQ events |
| Capa | Mecanismo | Detecta |
|---|---|---|
| Proceso colgado | Watchdog de systemd (120s) | Deadlocks, loops infinitos, stalls de memoria |
| Proceso crasheado | Auto-reinicio de systemd (5s) | Panics, OOM kills, segfaults |
| Falla de dependencia | Monitor auto-reparable (probes cada 60s) | ClickHouse, DragonflyDB, R2, Supabase caidos |
| Backbone totalmente caido | Cron externo de CF Worker (cada hora) | Caida de red, incendio en DC, perdida total del servidor |
| Corrupcion silenciosa | Verificacion de integridad del WAL | Bit rot, escrituras truncadas, corrupcion de gzip |
| Acumulacion en cola de failover | Contadores de profundidad de DLQ en /health | Ruta primaria degradada antes de impacto visible |
| Perdida de DB de config | PITR de Supabase + backups diarios | Borrados accidentales, corrupcion de schema |
| Validacion de recuperacion | drp-test.sh (checks de WAL + DLQ) | Fallas de integridad de backup, eventos DLQ obsoletos |
All S2S postbacks from AppsFlyer are verified using HMAC-SHA256 signatures. The Go middleware computes HMAC(body, secret) and compares against X-AF-Signature header.
Todos los postbacks S2S de AppsFlyer se verifican usando firmas HMAC-SHA256. El middleware Go calcula HMAC(body, secret) y compara contra el header X-AF-Signature.
Fail-closed: If secret is not configured, ALL requests are rejected with 503. No bypass possible. Si el secret no esta configurado, TODAS las peticiones son rechazadas con 503. Sin bypass posible.
Admin API endpoints protected by Bearer tokens verified against environment secrets. Authorization: Bearer <token> header required.
Endpoints admin de la API protegidos por Bearer tokens verificados contra secrets del entorno. Header Authorization: Bearer <token> requerido.
Fail-closed: Empty token = all requests rejected. No development bypass. Token vacio = todas las peticiones rechazadas. Sin bypass de desarrollo.
Frontend and API use Supabase Auth JWTs with 4-tier RBAC: admin, client_manager, partner, viewer. Each role has strict data isolation.
Frontend y API usan JWTs de Supabase Auth con RBAC de 4 niveles: admin, client_manager, partner, viewer. Cada rol tiene aislamiento estricto de datos.
Partners never see revenue Partners nunca ven revenue — only their commission amounts. Multi-tenant isolation enforced at query level. solo sus montos de comision. Aislamiento multi-tenant forzado a nivel de query.
Supabase RLS policies enforce data isolation at the database level. Even if application code has a bug, the database rejects unauthorized access. Las politicas RLS de Supabase fuerzan aislamiento de datos a nivel de base de datos. Incluso si el codigo de aplicacion tiene un bug, la base de datos rechaza acceso no autorizado.
Audit logs track every admin action with user_id, action, resource, ip_address.
Los logs de auditoria rastrean cada accion admin con user_id, action, resource, ip_address.
Click-to-Install Time analysis. Our own click timestamp (from click wrapper) vs conversion time. Catches click injection and click flooding. Analisis de tiempo Click-a-Instalacion. Nuestro propio timestamp de click (del click wrapper) vs tiempo de conversion. Detecta inyeccion y flooding de clicks.
Click geo vs install geo comparison. Flags VPN-based fraud where clicks come from different countries than actual installs. Comparacion de geo de click vs geo de instalacion. Detecta fraude basado en VPN donde clicks vienen de paises diferentes a las instalaciones reales.
Cross-references device properties, OS version, screen resolution. Detects emulators, device farms, and spoofed device IDs. Cruza propiedades del dispositivo, version de OS, resolucion de pantalla. Detecta emuladores, granjas de dispositivos e IDs falsificados.
Clicks and installs per device per hour/day. Abnormal burst patterns indicate bot activity or click flooding attacks. Clicks e instalaciones por dispositivo por hora/dia. Patrones de rafaga anormales indican actividad de bots o ataques de click flooding.
XGBoost model trained on historical fraud data. Combines all signals into final 0-255 score. >200 = flagged, >240 = blocked. Modelo XGBoost entrenado con datos historicos de fraude. Combina todas las senales en un score final 0-255. >200 = marcado, >240 = bloqueado.
Every event is persisted to at least 2 independent systems before acknowledgment. Even total server destruction results in zero data loss with full recovery in under 4 hours. Cada evento se persiste en al menos 2 sistemas independientes antes de confirmacion. Incluso la destruccion total del servidor resulta en cero perdida de datos con recuperacion completa en menos de 4 horas.