Infrastructure

Data Pipeline & High Availability Pipeline de Datos & Alta Disponibilidad

Zero data loss architecture with R2 WAL, 3-tier failover, DLQ auto-recovery, and self-healing infrastructure. Every event persisted to at least 2 independent systems before acknowledgment. Arquitectura de cero perdida de datos con R2 WAL, failover de 3 niveles, auto-recuperacion DLQ e infraestructura auto-reparable. Cada evento persistido en al menos 2 sistemas independientes antes de confirmacion.

0
Data Loss (any scenario)Perdida de Datos
11
Nines Durability (R2)Nueves Durabilidad (R2)
<2 min
Auto-Failover RTORTO Auto-Failover
$2-3
WAL Cost/monthCosto WAL/mes
5
Durability LayersCapas de Durabilidad
8.6x
Compression RatioRatio Compresion

1. R2 Write-Ahead Log (WAL) 1. Write-Ahead Log (WAL) en R2

1

How the WAL Works Como Funciona el WAL

Every event is written to Cloudflare R2 (11 nines durability) before being sent to ClickHouse. Events are batched as gzipped NDJSON files with a time-based key structure. If ClickHouse or the Go service is down, events are safely buffered in R2 and replayed on recovery. Cada evento se escribe en Cloudflare R2 (11 nueves de durabilidad) antes de enviarse a ClickHouse. Los eventos se agrupan en archivos NDJSON comprimidos con gzip con una estructura de key basada en tiempo. Si ClickHouse o el servicio Go esta caido, los eventos se almacenan de forma segura en R2 y se replayan al recuperarse.

// WAL key format
wal/events/YYYY/MM/DD/HH/{ulid}.ndjson.gz

// Example
wal/events/2026/03/19/14/01JQABCDEF123456789.ndjson.gz

// Each file contains ~100-1000 events as newline-delimited JSON
// Compressed with gzip for ~5x size reduction
2

Event Durability Chain Cadena de Durabilidad de Eventos

// Every event passes through this chain:
Event arrives → R2 WAL (persisted, 11 nines) → ClickHouse (NVMe RAID-1)
                                                    → Supabase (dual-write, purchases)
                                                    → DragonflyDB (identity cache)

// Even if Go crashes mid-flight, the event is already in R2
// R2 replay tool reconstructs any missing events
3

R2 Replay Tool Herramienta de Replay R2

For disaster recovery, the R2 replay tool can reconstruct the entire ClickHouse database from WAL data. It reads gzipped NDJSON files from R2, decompresses them, and re-ingests into ClickHouse with deduplication. Para recuperacion de desastres, la herramienta de replay R2 puede reconstruir toda la base de datos ClickHouse desde los datos del WAL. Lee archivos NDJSON comprimidos de R2, los descomprime y los reingesta en ClickHouse con deduplicacion.

// Replay events from a specific date range
replay --from 2026-03-18 --to 2026-03-19

// Replay all events (full disaster recovery)
replay --from 2026-01-01 --to 2026-03-19

// Deduplication via event_id ensures no double-counting

WAL Cost EstimateEstimacion de Costo del WAL

At 25M events/month, the WAL generates ~2-3 GB of compressed data per month. R2 pricing: $0.015/GB storage + $0.36/M Class A writes. Total: $2-3/month for complete zero-data-loss guarantee. Con 25M eventos/mes, el WAL genera ~2-3 GB de datos comprimidos por mes. Precios R2: $0.015/GB almacenamiento + $0.36/M escrituras Clase A. Total: $2-3/mes para garantia completa de cero perdida de datos.

2. 3-Tier Failover Architecture 2. Arquitectura de Failover de 3 Niveles

All 3 edge workers (Click Wrapper, Web Pixel, S2S Proxy) implement the same 3-tier failover pattern. If the primary Go ingest service is down, events are never lost. Los 3 edge workers (Click Wrapper, Web Pixel, S2S Proxy) implementan el mismo patron de failover de 3 niveles. Si el servicio Go primario esta caido, los eventos nunca se pierden.

// 3-Tier Failover — all edge workers

Tier 1: Hetzner Go Ingest (primary)
  ingest.relo.mx via Cloudflare Tunnel
  CPX41: 16GB RAM, 8 vCPU
  ClickHouse + DragonflyDB co-located
  ↓ health check fails (3x in 90s)

Tier 2: VPS Warm Standby (DigitalOcean)
  Always-on Go receiver on separate server
  Buffers events to disk as JSON files
  Bidirectional HA with Hetzner primary
  ↓ standby also unreachable

Tier 3: R2 Dead Letter Queue (DLQ)
  Events written as JSON to R2 object store
  11 nines durability, permanent storage
  Auto-replay loops recover events

Per-Worker Failover DetailsDetalles de Failover por Worker

Each edge worker has the same 3-tier pattern with slightly different recovery intervals:Cada edge worker tiene el mismo patron de 3 niveles con intervalos de recuperacion ligeramente diferentes:

Click Wrapper (t.relo.mx/c/:code)
  Tier 1: Go Ingest (<50ms)
  Tier 2: KV fallback (24h TTL, ~30s recovery)
  Tier 3: R2 DLQ (auto-replay every 30s)

Web Pixel (p.relo.mx)
  Tier 1: Go Ingest (<100ms)
  Tier 2: VPS Standby
  Tier 3: R2 DLQ (auto-replay every 60s)

S2S Proxy (s2s.relo.mx)
  Tier 1: Go Ingest (<200ms)
  Tier 2: VPS Warm Standby
  Tier 3: R2 DLQ (auto-replay every 60s)

3. DLQ Auto-Recovery Loops 3. Bucles de Auto-Recuperacion DLQ

Each DLQ has its own recovery loop that periodically checks for pending events and replays them to the Go ingest service. Recovery is fully automatic with no manual intervention. Cada DLQ tiene su propio bucle de recuperacion que periodicamente revisa si hay eventos pendientes y los replaya al servicio Go de ingesta. La recuperacion es completamente automatica sin intervencion manual.

// DLQ Recovery Loop Intervals

S2S DLQ:    every 60s — checks for failed S2S postbacks
Pixel DLQ:  every 60s — checks for failed pixel events
Click DLQ:  every 30s — checks for failed click events
R2 WAL:     every 60s — checks for un-processed WAL files

// Recovery flow
DLQ check → list pending objects → POST to Go ingest → delete on success
                                    → retry on failure (next cycle)

4. Five Layers of Data Durability 4. Cinco Capas de Durabilidad de Datos

Layer 1: R2 WAL         — Write-ahead log, 11 nines durability, gzipped NDJSON
Layer 2: ClickHouse      — NVMe RAID-1, 24-month TTL, ZSTD 8.6x compression
                          12 tables including 6 materialized views
Layer 3: Supabase        — Cloud PostgreSQL, dual-write for purchases
                          4 aggregate tables synced every 10 min
Layer 4: VPS Standby     — Bidirectional HA with Hetzner primary
Layer 5: Daily Backups   — Incremental to R2, full weekly export

RPO: ~0 (real-time WAL)  |  RTO: <2 min (auto-failover)

5. Dashboard Aggregate Pipeline 5. Pipeline de Agregados del Dashboard

ClickHouse materialized views auto-refresh every 5 minutes. The Go aggregator syncs these to 4 Supabase tables every 10 minutes with a 7-day lookback window. The admin dashboard reads exclusively from these aggregate tables, not from raw product_sales. Las vistas materializadas de ClickHouse se refrescan automaticamente cada 5 minutos. El agregador Go sincroniza estos a 4 tablas Supabase cada 10 minutos con una ventana de 7 dias. El dashboard admin lee exclusivamente de estas tablas de agregados, no de product_sales raw.

// 4 Aggregate Tables (ClickHouse → Supabase every 10 min)

1. partner_daily_aggregates  — per partner/segment/day: units, orders, revenue, commission
2. geo_daily_aggregates      — per city/region/day: units, orders, revenue
3. media_source_daily        — per media source/day: units, orders, revenue
4. hourly_stats              — per hour-of-day: units, orders (for heatmaps)

// Orders tab uses direct ClickHouse proxy (Go endpoint)
// All other dashboard tabs read from Supabase aggregates

6. Current System Stats (March 2026) 6. Estadisticas Actuales del Sistema (Marzo 2026)

157K+
Events in ClickHouseEventos en ClickHouse
12
ClickHouse TablesTablas ClickHouse
6
Materialized ViewsVistas Materializadas
8.6x
Compression RatioRatio de Compresion
10 min
Aggregator Sync CycleCiclo de Sync Agregador
$2-3
WAL Monthly CostCosto Mensual WAL