BIX Tech

Airbyte + dbt + Airflow: Building a Modern, Reliable ELT Stack from Scratch

Build a modern ELT stack with Airbyte, dbt & Airflow-architecture, data flow, best practices, and pitfalls for reliable, scalable analytics.

11 min of reading
Airbyte + dbt + Airflow: Building a Modern, Reliable ELT Stack from Scratch

Get your project off the ground

Share

Laura Chicovis

By Laura Chicovis

IR by training, curious by nature. World and technology enthusiast.

Modern analytics isn’t just about having data-it’s about having trustworthy data that arrives on time, stays consistent, and can evolve as the business evolves. That’s exactly why the “modern ELT stack” has become a go-to blueprint for data teams: extract and load data quickly, then transform it in a transparent, testable way inside the warehouse.

A practical, proven combination for this approach is:

  • Airbyte for extraction + loading (connectors, incremental syncs, CDC where available)
  • dbt for transformations (SQL-based modeling, testing, documentation)
  • Apache Airflow for orchestration (scheduling, dependencies, retries, observability)

This article walks through how to design and implement this stack from scratch-architecture, data flow, best practices, and common pitfalls-so you can build an ELT foundation that’s resilient, scalable, and easy to maintain.


What Is a Modern ELT Stack?

ELT vs. ETL (and why ELT wins for analytics)

In ETL, transformations happen before data lands in the warehouse. In ELT, you:

  1. Extract from sources (apps, databases, APIs)
  2. Load raw data into a warehouse/lakehouse quickly
  3. Transform inside the warehouse using scalable compute and SQL-based logic

ELT is especially effective when using cloud data warehouses (Snowflake, BigQuery, Redshift, Databricks SQL, Postgres for smaller setups) because:

  • loading raw data fast preserves fidelity
  • transformations are version-controlled and reproducible
  • compute can scale as needs grow

Why Airbyte + dbt + Airflow Works So Well Together

Airbyte: standardize ingestion without reinventing connectors

Airbyte is designed to simplify ingestion across many sources. Typical benefits include:

  • a large connector catalog (databases, SaaS tools, files)
  • incremental sync patterns for efficiency
  • schema evolution handling (depending on destination + setup)
  • normalization options (though many teams prefer dbt for downstream modeling)

dbt: transform with software engineering discipline

dbt makes transformations:

  • modular (models build on each other)
  • testable (unique, not null, accepted values, relationships, custom tests)
  • documented (auto-generated lineage graphs + docs)
  • deployable (CI checks, environments, versioning)

Airflow: production-grade orchestration

Airflow is ideal when you need:

  • complex dependencies (multiple sources feeding shared models)
  • retries and failure handling
  • SLAs and alerting hooks
  • a single “control plane” for pipelines beyond ingestion and dbt (ML jobs, exports, reverse ETL triggers, etc.)

Reference Architecture: How the Stack Fits Together

A clean design is usually a three-layer warehouse structure:

1) Raw layer (Bronze)

  • Airbyte loads data “as-is” into a raw schema
  • tables are typically source-aligned (e.g., raw_shopify.orders, raw_salesforce.leads)
  • minimal assumptions; prioritize completeness and traceability

2) Staging layer (Silver)

  • dbt creates staging models that:
  • standardize column names/types
  • handle deduplication
  • apply basic cleaning
  • define stable primary keys
  • these models create consistent building blocks across sources

3) Marts layer (Gold)

  • dbt creates business-facing models:
  • facts and dimensions
  • KPI tables
  • curated datasets for BI tools
  • logic is business-aligned, documented, and tested

Orchestration flow (Airflow)

A common DAG pattern is:

  1. Trigger or run Airbyte sync(s)
  2. Validate load completion (and optionally row counts / freshness checks)
  3. Run dbt (staging → marts)
  4. Run dbt tests
  5. Publish artifacts (docs, exposures) and notify stakeholders on failures

Step-by-Step: Building the Stack from Scratch

1) Define sources, destinations, and data contracts

Before tools, define the “shape” of the platform:

  • Sources: production DBs (Postgres/MySQL), SaaS tools (Stripe, HubSpot), event streams, files
  • Destination: Snowflake/BigQuery/Redshift/Databricks/Postgres
  • Contracts (lightweight but powerful):
  • expected freshness (e.g., hourly, daily)
  • key fields and uniqueness expectations
  • PII handling rules (masking, access control, retention)

Practical tip: start with the top 2–3 business-critical sources and prove the pattern before scaling connector count.


2) Set up Airbyte for extraction + loading

Choose a sync strategy

Airbyte supports different approaches depending on connector/source:

  • Full refresh: simplest, but can be expensive at scale
  • Incremental: based on cursor fields like updated_at (common for SaaS APIs)
  • CDC (Change Data Capture): best for databases when supported, enabling low-latency incremental replication

Organize raw schemas intentionally

A useful convention:

  • raw_ schemas or datasets
  • tables named predictably: raw_salesforce.account, raw_postgres.public_users

This makes lineage and debugging dramatically easier.

Make schema evolution predictable

Schema drift is inevitable-especially with SaaS tools. To reduce breakage:

  • treat the raw layer as append-only when possible
  • keep transformations tolerant (select explicit columns in dbt staging, and deliberately add new fields)
  • monitor for new/removed columns to avoid silent failures downstream

3) Build dbt foundations: staging, marts, tests, docs

Recommended dbt project structure

A clear structure pays off quickly:

  • models/staging//...
  • models/intermediate/... (optional, for shared transformations)
  • models/marts//... (finance, product, growth, etc.)

Start with staging models that “normalize reality”

Staging models typically:

  • rename columns to consistent conventions (snake_case)
  • cast types explicitly
  • deduplicate on keys
  • define surrogate_key when needed

Example staging goals:

  • stg_salesforce_lead has consistent identifiers and timestamps
  • stg_stripe_invoice has stable monetary fields and currency handling

Add tests early (before your warehouse becomes a swamp)

High-value tests include:

  • unique and not_null on primary keys
  • relationship tests between facts and dimensions
  • accepted values tests for enum-like fields
  • freshness checks (via dbt packages or your orchestration layer)

Document as you build

dbt docs aren’t busywork: they’re how teams scale understanding.

  • add model descriptions
  • annotate critical columns (IDs, timestamps, currency)
  • define exposures for BI dashboards where useful

4) Orchestrate with Airflow: dependable runs, observable pipelines

A practical DAG design

A common production-friendly DAG flow:

  • Task group: Ingestion
  • run Airbyte sync for each source (parallelizable)
  • Task group: Transform
  • dbt run for staging + intermediate + marts
  • Task group: Quality
  • dbt test
  • optional: custom validations (row counts, anomaly checks)
  • Task group: Publish
  • refresh semantic/metrics layers, notify BI, push docs artifacts

Key orchestration best practices

  • Idempotency: rerunning should not corrupt results
  • Retries with backoff: especially for API sources and transient warehouse errors
  • Alerts that route correctly: ingestion failures to data engineering; data test failures to analytics engineering; SLA failures to owners
  • Separate environments: dev vs. prod credentials, isolated schemas, controlled deployment

Production Hardening: What Usually Breaks (and How to Prevent It)

Handling late-arriving and changed data

Even “incremental” sources can deliver late updates. To handle this:

  • use windowed incremental logic (e.g., reload last 3–7 days in staging)
  • prefer CDC where possible for databases
  • build marts that can recompute impacted partitions/dates

Avoiding silent data quality failures

Pipelines fail loudly-until they don’t. Silent failures (bad joins, duplicated facts, partial loads) are worse than downtime.

Mitigations:

  • dbt tests on keys and relationships
  • volume checks (sudden drop/spike)
  • freshness tracking (last successful ingestion time)
  • lineage-aware alerting (only notify owners of impacted downstream models)

Managing cost and performance

Warehouses can get expensive quickly. Common optimizations:

  • incremental dbt models for large facts
  • clustering/partitioning where supported
  • limit raw retention if compliance allows
  • schedule heavy transformations off-peak
  • pre-aggregate where BI queries are repetitive

Example Use Case: SaaS Revenue Analytics End-to-End

Here’s how the stack might support a revenue analytics pipeline:

Sources

  • Stripe (billing)
  • Salesforce (CRM)
  • Postgres (product + subscriptions)
  • Marketing platform (leads/campaigns)

Raw (Airbyte)

  • load each source into its own raw dataset
  • schedule Stripe hourly, Salesforce hourly, Postgres CDC near-real-time, marketing daily

Transform (dbt)

  • staging: align customer identifiers and timestamps
  • intermediate: build a unified customer mapping table
  • marts:
  • fct_revenue_daily
  • dim_customer
  • fct_pipeline
  • fct_churn

Orchestration (Airflow)

  • ingest in parallel
  • run transformations with clear dependencies
  • run tests and alert on failures
  • publish curated datasets to BI and downstream consumers

The result: a reliable set of revenue metrics with traceable lineage back to raw sources.


SEO Quick Answers (Featured Snippet Style)

What does Airbyte do in a modern ELT stack?

Airbyte extracts data from databases, SaaS tools, and APIs and loads it into a destination (like a data warehouse), typically into a raw layer that preserves source fidelity and supports incremental updates.

What does dbt do in a modern ELT stack?

dbt transforms raw loaded data inside the warehouse using SQL models, enabling modular transformations, testing, documentation, and version control so analytics datasets are reliable and maintainable.

What does Airflow do in a modern ELT stack?

Airflow orchestrates the end-to-end pipeline-scheduling ingestion and transformations, enforcing dependencies, retrying failed tasks, and triggering tests and alerts to ensure data arrives on time and meets quality expectations.

In what order should you run Airbyte, dbt, and Airflow?

Airflow typically orchestrates everything: it triggers Airbyte syncs first (extract + load), then runs dbt transformations, then runs dbt tests and publishing steps.


Common Mistakes to Avoid

  • Transforming in the raw layer: keep raw immutable; transform downstream in dbt
  • Skipping tests until “later”: retrofitting data quality is painful and slow
  • Letting schema drift break everything: explicitly select columns in staging and monitor changes
  • Over-orchestrating too early: keep DAGs simple at first; add complexity only when justified
  • No ownership model: define who owns sources, models, and SLAs

Final Thoughts: A Stack That Scales with Your Team

Airbyte, dbt, and Airflow complement each other because they draw clean boundaries: ingestion, transformation, and orchestration. When designed with layered modeling, strong testing, and pragmatic orchestration, this modern ELT stack can support everything from early analytics to enterprise-grade data products-without turning your warehouse into an unmanageable mess.

With the right conventions and discipline, teams get faster time-to-insight, higher confidence in metrics, and a platform that can evolve as data sources, business logic, and reporting needs change.

Related articles

Want better software delivery?

See how we can make it happen.

Talk to our experts

No upfront fees. Start your project risk-free. No payment if unsatisfied with the first sprint.

Time BIX