Airbyte + dbt + Airflow: Building a Modern, Reliable ELT Stack from Scratch

IR by training, curious by nature. World and technology enthusiast.

Modern analytics isn’t just about having data-it’s about having trustworthy data that arrives on time, stays consistent, and can evolve as the business evolves. That’s exactly why the “modern ELT stack” has become a go-to blueprint for data teams: extract and load data quickly, then transform it in a transparent, testable way inside the warehouse.

A practical, proven combination for this approach is:

Airbyte for extraction + loading (connectors, incremental syncs, CDC where available)
dbt for transformations (SQL-based modeling, testing, documentation)
Apache Airflow for orchestration (scheduling, dependencies, retries, observability)

This article walks through how to design and implement this stack from scratch-architecture, data flow, best practices, and common pitfalls-so you can build an ELT foundation that’s resilient, scalable, and easy to maintain.

What Is a Modern ELT Stack?

ELT vs. ETL (and why ELT wins for analytics)

In ETL, transformations happen before data lands in the warehouse. In ELT, you:

Extract from sources (apps, databases, APIs)
Load raw data into a warehouse/lakehouse quickly
Transform inside the warehouse using scalable compute and SQL-based logic

ELT is especially effective when using cloud data warehouses (Snowflake, BigQuery, Redshift, Databricks SQL, Postgres for smaller setups) because:

loading raw data fast preserves fidelity
transformations are version-controlled and reproducible
compute can scale as needs grow

Why Airbyte + dbt + Airflow Works So Well Together

Airbyte: standardize ingestion without reinventing connectors

Airbyte is designed to simplify ingestion across many sources. Typical benefits include:

a large connector catalog (databases, SaaS tools, files)
incremental sync patterns for efficiency
schema evolution handling (depending on destination + setup)
normalization options (though many teams prefer dbt for downstream modeling)

dbt: transform with software engineering discipline

dbt makes transformations:

modular (models build on each other)
testable (unique, not null, accepted values, relationships, custom tests)
documented (auto-generated lineage graphs + docs)
deployable (CI checks, environments, versioning)

Airflow: production-grade orchestration

Airflow is ideal when you need:

complex dependencies (multiple sources feeding shared models)
retries and failure handling
SLAs and alerting hooks
a single “control plane” for pipelines beyond ingestion and dbt (ML jobs, exports, reverse ETL triggers, etc.)

Reference Architecture: How the Stack Fits Together

A clean design is usually a three-layer warehouse structure:

1) Raw layer (Bronze)

Airbyte loads data “as-is” into a raw schema
tables are typically source-aligned (e.g., raw_shopify.orders, raw_salesforce.leads)
minimal assumptions; prioritize completeness and traceability

2) Staging layer (Silver)

dbt creates staging models that:
standardize column names/types
handle deduplication
apply basic cleaning
define stable primary keys
these models create consistent building blocks across sources

3) Marts layer (Gold)

dbt creates business-facing models:
facts and dimensions
KPI tables
curated datasets for BI tools
logic is business-aligned, documented, and tested

Orchestration flow (Airflow)

A common DAG pattern is:

Trigger or run Airbyte sync(s)
Validate load completion (and optionally row counts / freshness checks)
Run dbt (staging → marts)
Run dbt tests
Publish artifacts (docs, exposures) and notify stakeholders on failures

Step-by-Step: Building the Stack from Scratch

1) Define sources, destinations, and data contracts

Before tools, define the “shape” of the platform:

Sources: production DBs (Postgres/MySQL), SaaS tools (Stripe, HubSpot), event streams, files
Destination: Snowflake/BigQuery/Redshift/Databricks/Postgres
Contracts (lightweight but powerful):
expected freshness (e.g., hourly, daily)
key fields and uniqueness expectations
PII handling rules (masking, access control, retention)

Practical tip: start with the top 2–3 business-critical sources and prove the pattern before scaling connector count.

2) Set up Airbyte for extraction + loading

Choose a sync strategy

Airbyte supports different approaches depending on connector/source:

Full refresh: simplest, but can be expensive at scale
Incremental: based on cursor fields like updated_at (common for SaaS APIs)
CDC (Change Data Capture): best for databases when supported, enabling low-latency incremental replication

Organize raw schemas intentionally

A useful convention:

raw_ schemas or datasets
tables named predictably: raw_salesforce.account, raw_postgres.public_users

This makes lineage and debugging dramatically easier.

Make schema evolution predictable

Schema drift is inevitable-especially with SaaS tools. To reduce breakage:

treat the raw layer as append-only when possible
keep transformations tolerant (select explicit columns in dbt staging, and deliberately add new fields)
monitor for new/removed columns to avoid silent failures downstream

3) Build dbt foundations: staging, marts, tests, docs

Recommended dbt project structure

A clear structure pays off quickly:

models/staging//...
models/intermediate/... (optional, for shared transformations)
models/marts//... (finance, product, growth, etc.)

Start with staging models that “normalize reality”

Staging models typically:

rename columns to consistent conventions (snake_case)
cast types explicitly
deduplicate on keys
define surrogate_key when needed

Example staging goals:

stg_salesforce_lead has consistent identifiers and timestamps
stg_stripe_invoice has stable monetary fields and currency handling

Add tests early (before your warehouse becomes a swamp)

High-value tests include:

unique and not_null on primary keys
relationship tests between facts and dimensions
accepted values tests for enum-like fields
freshness checks (via dbt packages or your orchestration layer)

Document as you build

dbt docs aren’t busywork: they’re how teams scale understanding.

add model descriptions
annotate critical columns (IDs, timestamps, currency)
define exposures for BI dashboards where useful

4) Orchestrate with Airflow: dependable runs, observable pipelines

A practical DAG design

A common production-friendly DAG flow:

Task group: Ingestion
run Airbyte sync for each source (parallelizable)
Task group: Transform
dbt run for staging + intermediate + marts
Task group: Quality
dbt test
optional: custom validations (row counts, anomaly checks)
Task group: Publish
refresh semantic/metrics layers, notify BI, push docs artifacts

Key orchestration best practices

Idempotency: rerunning should not corrupt results
Retries with backoff: especially for API sources and transient warehouse errors
Alerts that route correctly: ingestion failures to data engineering; data test failures to analytics engineering; SLA failures to owners
Separate environments: dev vs. prod credentials, isolated schemas, controlled deployment

Production Hardening: What Usually Breaks (and How to Prevent It)

Handling late-arriving and changed data

Even “incremental” sources can deliver late updates. To handle this:

use windowed incremental logic (e.g., reload last 3–7 days in staging)
prefer CDC where possible for databases
build marts that can recompute impacted partitions/dates

Avoiding silent data quality failures

Pipelines fail loudly-until they don’t. Silent failures (bad joins, duplicated facts, partial loads) are worse than downtime.

Mitigations:

dbt tests on keys and relationships
volume checks (sudden drop/spike)
freshness tracking (last successful ingestion time)
lineage-aware alerting (only notify owners of impacted downstream models)

Managing cost and performance

Warehouses can get expensive quickly. Common optimizations:

incremental dbt models for large facts
clustering/partitioning where supported
limit raw retention if compliance allows
schedule heavy transformations off-peak
pre-aggregate where BI queries are repetitive

Example Use Case: SaaS Revenue Analytics End-to-End

Here’s how the stack might support a revenue analytics pipeline:

Sources

Stripe (billing)
Salesforce (CRM)
Postgres (product + subscriptions)
Marketing platform (leads/campaigns)

Raw (Airbyte)

load each source into its own raw dataset
schedule Stripe hourly, Salesforce hourly, Postgres CDC near-real-time, marketing daily

Transform (dbt)

staging: align customer identifiers and timestamps
intermediate: build a unified customer mapping table
marts:
fct_revenue_daily
dim_customer
fct_pipeline
fct_churn

Orchestration (Airflow)

ingest in parallel
run transformations with clear dependencies
run tests and alert on failures
publish curated datasets to BI and downstream consumers

The result: a reliable set of revenue metrics with traceable lineage back to raw sources.

SEO Quick Answers (Featured Snippet Style)

What does Airbyte do in a modern ELT stack?

Airbyte extracts data from databases, SaaS tools, and APIs and loads it into a destination (like a data warehouse), typically into a raw layer that preserves source fidelity and supports incremental updates.

What does dbt do in a modern ELT stack?

dbt transforms raw loaded data inside the warehouse using SQL models, enabling modular transformations, testing, documentation, and version control so analytics datasets are reliable and maintainable.

What does Airflow do in a modern ELT stack?

Airflow orchestrates the end-to-end pipeline-scheduling ingestion and transformations, enforcing dependencies, retrying failed tasks, and triggering tests and alerts to ensure data arrives on time and meets quality expectations.

In what order should you run Airbyte, dbt, and Airflow?

Airflow typically orchestrates everything: it triggers Airbyte syncs first (extract + load), then runs dbt transformations, then runs dbt tests and publishing steps.

Common Mistakes to Avoid

Transforming in the raw layer: keep raw immutable; transform downstream in dbt
Skipping tests until “later”: retrofitting data quality is painful and slow
Letting schema drift break everything: explicitly select columns in staging and monitor changes
Over-orchestrating too early: keep DAGs simple at first; add complexity only when justified
No ownership model: define who owns sources, models, and SLAs

Final Thoughts: A Stack That Scales with Your Team

Airbyte, dbt, and Airflow complement each other because they draw clean boundaries: ingestion, transformation, and orchestration. When designed with layered modeling, strong testing, and pragmatic orchestration, this modern ELT stack can support everything from early analytics to enterprise-grade data products-without turning your warehouse into an unmanageable mess.

With the right conventions and discipline, teams get faster time-to-insight, higher confidence in metrics, and a platform that can evolve as data sources, business logic, and reporting needs change.