AI & Machine Learning

How to Build a Data Pipeline: Architecture, Tools & Best Practices (2026)

Ravi JadhavMarch 7, 202613 min read

Quick Answer

A data pipeline moves data from source systems to destinations where it can be analysed. The core components are: ingestion (extracting from sources), storage (landing in a data lake or warehouse), transformation (cleaning, modelling), and serving (BI tools, ML models). The modern default stack is Airbyte (ingestion) + Snowflake/BigQuery (warehouse) + dbt (transformation) + Airflow or Dagster (orchestration) + Metabase/Tableau (BI). Start with batch pipelines before building streaming — most analytical use cases do not need real-time data.

Commercial Expertise

Need help with AI & Machine Learning?

Ortem deploys dedicated AI & ML Engineering squads in 72 hours.

Deploy Private AI

What Is a Data Pipeline?

A data pipeline is an automated system that moves data from where it is generated (source systems: databases, APIs, event streams, SaaS tools) to where it is analysed (data warehouse, data lake, BI tools, ML models).

The pipeline handles extraction, validation, transformation, loading, and scheduling — automatically, reliably, and with alerting when something breaks.

Batch vs Streaming Pipelines

Before designing your pipeline, decide which model you need:

Batch processing: Data moves in scheduled chunks (hourly, daily). Simpler to build, easier to debug, sufficient for most analytical use cases. 90% of business intelligence does not need real-time data.

Streaming: Data moves continuously as events occur (milliseconds to seconds latency). Required for: real-time fraud detection, live dashboards, event-driven ML models, operational analytics.

Recommendation: Start with batch. Add streaming only when you can clearly articulate a business decision that requires sub-minute data freshness.

Core Pipeline Components

1. Ingestion Layer

Extracts data from sources and lands it in your storage layer.

Managed connectors (recommended):

Fivetran — 500+ pre-built connectors, fully managed, zero maintenance
Airbyte — open-source, 350+ connectors, self-hosted or cloud
Stitch — simple, affordable, good for small teams

Custom ingestion (when connectors don't exist):

Python scripts using requests, SQLAlchemy, or vendor SDKs
Apache Kafka or AWS Kinesis for streaming sources

2. Storage Layer

Where raw data lands.

Data warehouse (structured, SQL queries, analytics):

Snowflake — best overall, any cloud, pay-per-query
BigQuery — best on GCP, serverless, excellent for large datasets
Redshift — best on AWS, good if you are AWS-native

Data lake (raw files, semi-structured, large scale):

AWS S3 + Glue/Athena
Azure Data Lake Storage + Synapse
Databricks Lakehouse — combines lake and warehouse

3. Transformation Layer

Cleans, joins, and models raw data into analytics-ready tables.

dbt (data build tool) is the standard for warehouse transformations:

Write transformations as SELECT statements
dbt handles materialisation, dependency ordering, and documentation
Integrated testing (not-null, unique, referential integrity checks)
Version control your transformation logic in Git

-- dbt model: orders_daily.sql
SELECT
    date_trunc('day', created_at) AS order_date,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1

4. Orchestration Layer

Schedules pipeline runs and manages dependencies between steps.

Apache Airflow — most mature, large ecosystem, requires operational overhead
Prefect — Python-native, excellent developer experience, managed cloud option
Dagster — asset-based, great for data teams, built-in data lineage

5. Serving Layer

Delivers processed data to consumers.

BI tools: Metabase (open-source), Looker, Tableau, Power BI
ML features: Feature stores (Feast, Tecton)
APIs: Expose aggregated data via REST or GraphQL for product features

Data Quality and Observability

A pipeline without monitoring is a liability. Implement:

Schema validation: Assert expected columns and types on ingestion
dbt tests: not_null, unique, accepted_values, referential integrity
Row count checks: Alert when daily load is 30% below historical average
Freshness checks: Alert when data is older than expected
Tools: Monte Carlo, Great Expectations, dbt Cloud observability

Pipeline Architecture Example

Postgres DB → Airbyte → Snowflake (raw schema)
                              ↓
Salesforce  → Airbyte → Snowflake (raw schema)
                              ↓
Stripe API  → Airbyte → Snowflake (raw schema)
                              ↓
                         dbt transforms
                              ↓
                    Snowflake (analytics schema)
                              ↓
                    Metabase / Tableau / Looker

Build your data infrastructure with Ortem. Talk to our data engineering team → or contact us to scope your data pipeline project.

📬

Get the Ortem Tech Digest

Monthly insights on AI, mobile, and software strategy - straight to your inbox. No spam, ever.

Data PipelineData EngineeringData ArchitectureETLModern Data Stack

About the Author

Ravi Jadhav

Technical Lead, Ortem Technologies

Ravi Jadhav is a Technical Lead at Ortem Technologies with 12 years of experience leading development teams and managing complex software projects. He brings a deep understanding of software engineering best practices, agile methodologies, and scalable system architecture. Ravi is passionate about building high-performing engineering teams and delivering technology solutions that drive measurable results for clients across industries.

Technical LeadershipProject ManagementSoftware Architecture

Stay Ahead