Data Quality in Production

IR by training, curious by nature. World and technology enthusiast.

Production data is only as valuable as it is reliable. When dashboards don’t match finance numbers, when a model silently starts drifting, or when a pipeline breaks because a “nullable” column suddenly isn’t, the real cost isn’t just a failed job-it’s lost confidence.

A modern, scalable approach to data quality in production combines three complementary layers:

dbt tests to enforce quality close to transformations
Great Expectations to run richer validations and profiling at critical points in your pipelines
DataHub to document, discover, and operationalize quality through metadata, ownership, and lineage

This article breaks down how these tools fit together, the architecture patterns that work in real production environments, and how to design tests that improve trust without creating an alert-fatigue machine.

Why Data Quality in Production Is Hard (and Why It Matters)

In development, data issues are often obvious: missing columns, failed joins, type errors. In production, the most expensive problems are subtle:

A metric is “technically correct” but semantically wrong due to upstream changes
A null-rate gradually increases until it breaks downstream logic
A dimension table gets duplicate keys after an upstream deploy
A source system introduces a new enum value that isn’t mapped

The goal of production-grade data quality isn’t perfection-it’s early detection, fast triage, and clear accountability.

The Three-Layer Quality Stack: dbt + Great Expectations + DataHub

1) dbt Tests: Fast, Close to the Transformations

dbt tests are ideal for enforcing baseline assumptions in your warehouse right where transformations happen:

Schema tests like unique, not_null, and relationships
Custom SQL tests for business rules (e.g., “orders must have a positive total”)
Generic tests to standardize validation patterns across models

Best use cases:

Primary key uniqueness
Non-null constraints on required fields
Referential integrity checks between fact and dimension tables
Row-count sanity checks between staging and mart layers

Strength: Lightweight, native to analytics engineering workflows, easy to version control.

Limitation: While powerful, dbt tests are generally SQL-based and warehouse-centric-some teams want richer expectation libraries, profiling, or validations earlier/later in the pipeline.

2) Great Expectations: Rich Validations and Profiling Where It Counts

Great Expectations is a popular open-source framework designed specifically for data quality. It allows you to express “expectations” about your data (e.g., “this column should never be null” or “values must be within a range”), validate datasets, and generate documentation-style data quality reports.

Best use cases:

Validations on ingestion and pre-warehouse stages (files, lakehouse, intermediate tables)
Column-level rules beyond basic constraints (ranges, regex, set membership)
Profiling new datasets to establish baseline expectations
Combining multiple checks into suites and tracking validation results across runs

Strength: Expressive, extensible, and useful for complex datasets and multi-step pipelines.

Limitation: If implemented without a strategy, it can become “yet another testing framework” with duplicated checks. The key is clear boundaries: use dbt for transformation-level basics; use Great Expectations for deeper or earlier/later validations.

3) DataHub: Making Quality Discoverable, Governed, and Actionable

DataHub is a metadata platform (often described as a modern data catalog) that helps teams document datasets, trace lineage, assign ownership, and improve discoverability across the organization. The real advantage in a production data quality program is that it helps answer:

What is this dataset?
Who owns it?
Where does it come from (lineage)?
Can I trust it today (quality signals)?

Best use cases:

Centralized visibility of data assets and lineage
Ownership, domains, and documentation
Surfacing quality and freshness indicators to downstream consumers
Making data quality a shared operational practice-not just an engineering detail

Strength: Bridges technical checks with organizational context (ownership + lineage + documentation).

A Practical Production Architecture (That Doesn’t Overcomplicate Things)

A proven pattern is to run validations at two key checkpoints:

Checkpoint A: Ingestion / Staging Validation (Great Expectations)

Validate raw or lightly processed data right after ingestion:

Required columns exist
Null rates within expected thresholds
Format checks (dates, emails, IDs)
Allowed values (country codes, statuses)
Basic anomaly detection (e.g., row count deviates drastically)

This prevents broken or malformed data from polluting downstream layers.

Checkpoint B: Transformation / Mart Validation (dbt Tests + Select Great Expectations)

At the curated layer (marts/semantic tables), enforce:

Unique surrogate keys
Relationship integrity
Business rules tied to reporting
Metric sanity tests (e.g., refunds <= sales)

If you already use dbt heavily, keep the majority of these checks in dbt to avoid duplication.

Metadata + Visibility Layer (DataHub)

Publish:

Dataset metadata (descriptions, tags, owners)
Lineage (upstream/downstream dependencies)
Quality results and run history (where integrated)
Links to run logs or documentation

This makes quality observable and discoverable-especially for analysts and stakeholders.

What to Test in Production: A High-ROI Data Quality Checklist

A comprehensive data quality strategy focuses on a handful of categories that catch most production failures.

1) Schema & Contract Tests

Column existence
Data types
Required vs optional fields

Example: If a source silently renames customer_id to client_id, you want a hard failure early.

2) Uniqueness & Key Integrity

Primary key uniqueness
Duplicate detection
Surrogate key collisions

Example: Duplicate order_id values can double-count revenue.

3) Referential Integrity (Relationships)

Fact rows reference valid dimension keys
No orphaned foreign keys

Example: Orders reference product IDs that don’t exist in dim_products.

4) Valid Ranges & Distributions

Amounts are within realistic bounds
Dates aren’t in the future (unless expected)
Percentages remain between 0 and 1 (or 0 and 100)

Example: A pricing bug suddenly generates negative totals.

5) Null & Completeness Thresholds

Not just “not null,” but “null rate under X%”
Completeness by segment (e.g., region, product line)

Example: shipping_country nulls spike only for one warehouse integration.

6) Freshness & Volume Monitoring

Tables updated within expected time windows
Row counts within expected ranges

Example: A pipeline runs but produces 90% fewer rows due to an upstream filter change.

How to Avoid Alert Fatigue (The Silent Killer of Data Quality Programs)

A data quality system that constantly screams gets ignored. The fix is to treat data quality as a product with a thoughtful signal strategy.

Use Severity Levels

Critical: Must page/on-call (core revenue metrics, regulatory data)
High: Requires same-day investigation
Medium/Low: Tracked and triaged during business hours

Prefer Thresholds Over Absolutes Where Appropriate

Not every null should fail a pipeline. Many production datasets are probabilistic and imperfect.

Fail if null_rate(email) > 2%
Warn if row_count < 0.85 * 7_day_avg

Route Alerts to Owners (and Make Ownership Visible)

Quality issues shouldn’t bounce around Slack channels. Assign dataset ownership and route failures accordingly-this is where metadata platforms shine.

Integrating the Stack: What “Good” Looks Like

When these tools work together, the workflow becomes repeatable:

Developers define transformations and baseline tests in dbt.
Data engineers validate ingestion/staging datasets with Great Expectations suites.
Quality results and dataset context are published so downstream users can see what’s trusted, what’s failing, and who owns what.
Lineage makes impact analysis fast: if a source table fails validation, teams immediately see which dashboards and models are affected.

This creates a production environment where issues are caught early, triage is faster, and trust steadily increases.

Common Questions (Featured Snippet–Friendly)

What is the best way to ensure data quality in production?

The most effective approach is layered: use dbt tests for transformation-level checks, Great Expectations for richer validation (especially at ingestion and staging), and a metadata platform like DataHub to make quality results visible through ownership, documentation, and lineage. For a deeper look at how lineage supports compliance and rapid debugging, see data pipeline auditing and lineage.

Should I use dbt tests or Great Expectations?

Use dbt tests for fast, SQL-native checks close to your transformations (uniqueness, not-null, relationships). Use Great Expectations when you need richer rules, profiling, or validations outside dbt’s typical scope (ingestion files, complex column expectations, multi-step validation suites). If you want to go deeper on GX patterns, see automated data validation and testing with Great Expectations.

What should I test first for the highest ROI?

Start with: schema/contract checks, primary key uniqueness, referential integrity, null thresholds on critical fields, and freshness/row-count monitoring. These catch the majority of production incidents with minimal effort.

How do I prevent too many data quality alerts?

Use severity levels, thresholds instead of absolute rules, and route alerts by dataset ownership. Focus on a small number of high-signal checks rather than testing everything. If you’re designing alerting that stays actionable, consider building alerts and notifications with Grafana and Airflow.

Final Thoughts: Production Data Quality Is a System, Not a Script

Reliable analytics doesn’t come from one tool-it comes from a system that combines testing, validation, and visibility. dbt tests, Great Expectations, and DataHub each solve a different part of the problem. Together, they create a production-ready foundation where data issues are detected early, communicated clearly, and resolved faster.

The result is not just fewer pipeline failures-it’s stronger trust in metrics, better decision-making, and a data platform that scales with confidence.

Data Quality in Production: Integrating Great Expectations, dbt Tests, and DataHub for Trustworthy Analytics

Share

Why Data Quality in Production Is Hard (and Why It Matters)

The Three-Layer Quality Stack: dbt + Great Expectations + DataHub

1) dbt Tests: Fast, Close to the Transformations

2) Great Expectations: Rich Validations and Profiling Where It Counts

3) DataHub: Making Quality Discoverable, Governed, and Actionable

A Practical Production Architecture (That Doesn’t Overcomplicate Things)

Checkpoint A: Ingestion / Staging Validation (Great Expectations)