Production-shaped data engineering portfolio
Ingest to Insight
A full platform case study for how I design trusted data systems: source ownership, CDC, governed batch and streaming, warehouse publishing, dbt Mesh, observability, and safe AI access.
Why this is attractive to senior readers
It demonstrates platform judgment, not just tool familiarity.
The repo shows how data reliability, governance, developer ergonomics, and consumer trust fit together in a system that looks like something a team could operate.
Idempotent, auditable data movement
Atomic object layouts, manifest state transitions, source-target reconciliation, commit markers, and repair paths protect the batch lane from duplicates and partial publishes.
Controls where risky operations happen
Dataset pause switches, approvals, maintenance windows, break-glass overrides, lineage, contracts, and audit tables turn governance into runtime behavior.
Metadata-driven platform boundaries
Registry-backed discovery and orchestrator adapters let new pipelines and data products be added through metadata instead of hardcoded API changes.
Trusted analytics and AI access
Postgres-first dbt Mesh marts, public model contracts, certified products, semantic metadata, and a read-only SQL assistant show the path from engineering controls to useful decisions.
Architecture
From operational systems to governed data products.
Retail, banking, and commerce sources feed CDC, batch ingestion, object storage, Spark processing, warehouse serving schemas, dbt Mesh marts, BI, governance APIs, and an AI assistant.
Clear ownership boundaries
Source systems own writes, Debezium captures changes, Airflow and Spark materialize data, dbt owns public analytical contracts, and the control plane exposes metadata without duplicating orchestration logic.
Portable deployment thinking
Compose, Helm, Kustomize, and Terraform assets show how the same platform concerns move from local development into cloud or Kubernetes environments.
Component map
Every layer has a job, contract, and audience.
Source applications
Retail, bank, and commerce services model application-owned writes, protected business endpoints, metrics, and outbox events.
sources/, oltp/mysql/CDC and streaming
Debezium, Kafka, Schema Registry, and Spark streaming separate raw row capture from business events and downstream readiness.
platform/schema-registry/, pipelines/spark/Batch runtime
Airflow DAGs and runtime plugins handle watermarks, late data, backfills, contracts, manifests, and atomic publish repair.
pipelines/airflow/Object lake and warehouse
MinIO layouts, Postgres serving schemas, and Snowflake-ready DDL demonstrate the storage and serving split.
warehouse/, docker-compose.yamlContracts and registry
JSON contracts and Git-versioned registry files define schema, SLA, owners, sensitivity, certification, and discovery.
contracts/v1/, registry/.NET control plane
A metadata-backed API exposes data products, pipelines, health, lineage, alerts, approvals, and workflow delegation.
apps/api/dbt Mesh
Domain-owned marts, public model access, source freshness, semantic assets, exposures, and certification evidence sit after the governed serving layer.
warehouse/dbt/AI warehouse assistant
pgvector metadata retrieval, deterministic SQL templates, query validation, read-only execution, JWT roles, audit, and feedback loops.
apps/ai-assistant/Production signals
What the implementation makes visible.
The most valuable signal is not the list of tools. It is the way failure modes, review paths, deployment boundaries, and user trust are designed into the platform.
Reliability loop
- Contracts validate rows and split accepted, warning, and rejected records.
- Atomic object layouts publish with manifests, checksums, and commit markers.
- Readiness sensors gate downstream warehouse and dbt work.
- Reconciliation compares source and target rows, hashes, and sums.
Operations loop
- Freshness SLAs and SLO burn-rate rules expose data product health.
- Prometheus, Alertmanager, Grafana, Jaeger, and ELK cover metrics, alerts, traces, and logs.
- Backfill previews estimate rows, runtime, cost, and overwrite risk before execution.
- Restore drills rehearse Postgres, MinIO, Kafka replay, and dbt rebuild validation.
Governance loop
- Dataset controls handle pause, approval, maintenance, concurrency, and emergency override.
- Lineage payloads publish source, object, target, contract, quality, and reconciliation context.
- Certified dbt products expose owners, SLAs, sensitivity, and reviewer metadata.
- AI-assisted access stays inside approved marts with transparent generated SQL.
Engineering signal
How this work translates across data, platform, and AI teams.
This project is built to make engineering judgment visible: deciding where contracts live, how runtime controls should be enforced, how observability should map to ownership, and how AI can be introduced without bypassing warehouse governance.
Lead production-minded delivery
Translate ambiguous data platform goals into maintainable services, runtime contracts, validation gates, and deployable assets.
Bridge engineering and analytics
Connect ingestion, CDC, warehouse design, dbt Mesh ownership, BI, and AI-assisted discovery into one operating model.
Design for trust
Make freshness, lineage, approvals, reconciliation, audit, and recovery visible before stakeholders have to ask for them.