"Apoorve didn't just fix our data infrastructure — he rebuilt how our entire organization thinks about data. The platform he designed became our competitive advantage."
The Challenge
A rapidly growing fintech company processing over $500M in monthly transactions had accumulated four years of technical debt in their data infrastructure. Their analytics stack consisted of 200+ fragile ETL pipelines built in Python scripts scattered across S3 buckets with no governance, documentation, or ownership.
The consequences were severe:
- Data scientists spent 70% of their time on data cleaning, not modeling
- Finance and product teams had conflicting metrics on the same KPIs
- A critical fraud model was running on 48-hour-old data, creating material risk
- Infrastructure costs had grown 3x year-over-year with no corresponding business value
The Approach
Phase 1: Assessment and Architecture Design (Weeks 1–4)
Conducted a comprehensive data audit across all systems: source databases, existing pipelines, downstream consumers, and ML models in production. Mapped data flows, identified single points of failure, and quantified the cost of each bottleneck.
Delivered an architecture proposal centered on a Databricks lakehouse with Delta Lake, dbt for transformation, and Dagster for orchestration — replacing the bespoke pipeline maze.
Phase 2: Foundation Build (Weeks 5–12)
Established the core platform:
- Migrated source data ingestion to Fivetran (managed connectors for Stripe, Salesforce, PostgreSQL)
- Deployed Delta Lake on S3 with medallion architecture (Bronze/Silver/Gold layers)
- Implemented dbt project structure with 150+ models, tests, and documentation
- Built Dagster DAGs to orchestrate all pipelines with alerting and SLA monitoring
Phase 3: Domain Migration (Weeks 13–20)
Migrated business domains in order of impact: Finance first (revenue, P&L), then Growth (funnel analytics), then Risk (fraud signals). Each migration included knowledge transfer to domain owners and analyst training on self-service access.
Phase 4: ML Platform and Real-Time (Weeks 21–28)
Built MLflow-backed model registry and feature store. Rebuilt the fraud detection pipeline on Kafka + Flink, reducing inference latency from 48 hours to under 200ms. Deployed automated model monitoring with drift detection.
Results
| Metric | Before | After |
|---|---|---|
| Pipeline failure rate | ~40% weekly | <5% weekly |
| Time from data to insight | 10–14 days | 2–4 hours |
| Data infrastructure cost | $2.1M/yr | $900K/yr |
| Fraud detection latency | 48 hours | <200ms |
| Self-service analytics users | 12 | 38 |
Key Lessons
Governance before tooling: The biggest leverage wasn’t the new stack — it was establishing data ownership. Each domain now has a designated data owner responsible for quality and documentation.
Migration sequencing matters: Starting with Finance built trust and executive support for the broader program. Early wins with high-visibility stakeholders accelerated organizational adoption.
Real-time is a product decision, not a data engineering default: The real-time fraud pipeline was justified by clear ROI (fraud loss reduction). Most other pipelines remained batch — the right call for their actual latency requirements.