open data pipeline
A transparent, open data pipeline that processes African data with privacy at its core.
Open Data Manifesto
“African data should serve African communities first. Every data point collected is anonymized and made available as open data, powering research, policy, and innovation across the continent.”
Pipeline Stages
Data flows through three stages, each with a clear responsibility and technology choice.
Kafka-compatible event streaming platform written in Rust. Every meaningful event — weather observations, campsite views, payments, posts — flows through as an event stream. Supabase and CouchDB publish change feeds into topics. Downstream systems consume independently. Provides architectural decoupling: operational databases do not need to know about the analytics layer.
License: BSL / Apache-2.0 (independent)
Consumes the raw event stream and applies transformations in real time before data reaches the open data layer. The most important transformation is the privacy filter — strips all personal identifiers (user IDs, IP addresses, device fingerprints), aggregates individual events into anonymised summaries, enriches records with geographic and temporal metadata, and outputs clean, safe, open data. The open data layer is structurally incapable of leaking personal information because personal information never enters it.
License: Apache-2.0 (apache-foundation)
Real-time analytical database originally built by Baidu, donated to the Apache Software Foundation. Designed for super app data scale — billions of rows, complex analytical queries, real-time ingestion. Columnar storage format means analytical queries only read relevant columns. Ingests the processed, anonymised event stream from Flink in real time. Exposes a MySQL-compatible SQL interface for researchers and a public REST API for developers.
License: Apache-2.0 (apache-foundation)
Privacy Boundary
All personally identifiable information (PII) is stripped at the Apache Flink stage before data enters the analytical layer.
Before Boundary
Raw events with user context, device IDs, and location precision. Processed in-memory only, never persisted with PII.
After Boundary
Anonymized, aggregated data points. Geographic precision reduced to district level. No individual can be identified.
Pipeline Flow
Events (apps, sensors, APIs)
│
▼
┌─────────────┐
│ Redpanda │ Event streaming (Kafka-compatible)
└──────┬──────┘
│
▼
┌─────────────┐
│ Apache Flink │ Stream processing + PII stripping
└──────┬──────┘
│ ← Privacy boundary (PII removed here)
▼
┌─────────────┐
│ Apache Doris │ Analytical database (open data)
└─────────────┘