How to Build a Data Pipeline: Architecture, Tools & Best Practices (2026)
A data pipeline moves data from source systems to destinations where it can be analysed. The core components are: ingestion (extracting from sources), storage (landing in a data lake or warehouse), transformation (cleaning, modelling), and serving (BI tools, ML models). The modern default stack is Airbyte (ingestion) + Snowflake/BigQuery (warehouse) + dbt (transformation) + Airflow or Dagster (orchestration) + Metabase/Tableau (BI). Start with batch pipelines before building streaming — most analytical use cases do not need real-time data.
Commercial Expertise
Need help with AI & Machine Learning?
Ortem deploys dedicated AI & ML Engineering squads in 72 hours.
What Is a Data Pipeline?
A data pipeline is an automated system that moves data from where it is generated (source systems: databases, APIs, event streams, SaaS tools) to where it is analysed (data warehouse, data lake, BI tools, ML models).
The pipeline handles extraction, validation, transformation, loading, and scheduling — automatically, reliably, and with alerting when something breaks.
Batch vs Streaming Pipelines
Before designing your pipeline, decide which model you need:
Batch processing: Data moves in scheduled chunks (hourly, daily). Simpler to build, easier to debug, sufficient for most analytical use cases. 90% of business intelligence does not need real-time data.
Streaming: Data moves continuously as events occur (milliseconds to seconds latency). Required for: real-time fraud detection, live dashboards, event-driven ML models, operational analytics.
Recommendation: Start with batch. Add streaming only when you can clearly articulate a business decision that requires sub-minute data freshness.
Core Pipeline Components
1. Ingestion Layer
Extracts data from sources and lands it in your storage layer.
Managed connectors (recommended):
- Fivetran — 500+ pre-built connectors, fully managed, zero maintenance
- Airbyte — open-source, 350+ connectors, self-hosted or cloud
- Stitch — simple, affordable, good for small teams
Custom ingestion (when connectors don't exist):
- Python scripts using requests, SQLAlchemy, or vendor SDKs
- Apache Kafka or AWS Kinesis for streaming sources
2. Storage Layer
Where raw data lands.
Data warehouse (structured, SQL queries, analytics):
- Snowflake — best overall, any cloud, pay-per-query
- BigQuery — best on GCP, serverless, excellent for large datasets
- Redshift — best on AWS, good if you are AWS-native
Data lake (raw files, semi-structured, large scale):
- AWS S3 + Glue/Athena
- Azure Data Lake Storage + Synapse
- Databricks Lakehouse — combines lake and warehouse
3. Transformation Layer
Cleans, joins, and models raw data into analytics-ready tables.
dbt (data build tool) is the standard for warehouse transformations:
- Write transformations as SELECT statements
- dbt handles materialisation, dependency ordering, and documentation
- Integrated testing (not-null, unique, referential integrity checks)
- Version control your transformation logic in Git
-- dbt model: orders_daily.sql
SELECT
date_trunc('day', created_at) AS order_date,
COUNT(*) AS total_orders,
SUM(total_amount) AS revenue,
AVG(total_amount) AS avg_order_value
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1
4. Orchestration Layer
Schedules pipeline runs and manages dependencies between steps.
- Apache Airflow — most mature, large ecosystem, requires operational overhead
- Prefect — Python-native, excellent developer experience, managed cloud option
- Dagster — asset-based, great for data teams, built-in data lineage
5. Serving Layer
Delivers processed data to consumers.
- BI tools: Metabase (open-source), Looker, Tableau, Power BI
- ML features: Feature stores (Feast, Tecton)
- APIs: Expose aggregated data via REST or GraphQL for product features
Data Quality and Observability
A pipeline without monitoring is a liability. Implement:
- Schema validation: Assert expected columns and types on ingestion
- dbt tests: not_null, unique, accepted_values, referential integrity
- Row count checks: Alert when daily load is 30% below historical average
- Freshness checks: Alert when data is older than expected
- Tools: Monte Carlo, Great Expectations, dbt Cloud observability
Pipeline Architecture Example
Postgres DB → Airbyte → Snowflake (raw schema)
↓
Salesforce → Airbyte → Snowflake (raw schema)
↓
Stripe API → Airbyte → Snowflake (raw schema)
↓
dbt transforms
↓
Snowflake (analytics schema)
↓
Metabase / Tableau / Looker
Build your data infrastructure with Ortem. Talk to our data engineering team → or contact us to scope your data pipeline project.
Get the Ortem Tech Digest
Monthly insights on AI, mobile, and software strategy - straight to your inbox. No spam, ever.
About the Author
Technical Lead, Ortem Technologies
Ravi Jadhav is a Technical Lead at Ortem Technologies with 12 years of experience leading development teams and managing complex software projects. He brings a deep understanding of software engineering best practices, agile methodologies, and scalable system architecture. Ravi is passionate about building high-performing engineering teams and delivering technology solutions that drive measurable results for clients across industries.
Stay Ahead
Get engineering insights in your inbox
Practical guides on software development, AI, and cloud. No fluff — published when it's worth your time.
Ready to Start Your Project?
Let Ortem Technologies help you build innovative solutions for your business.
You Might Also Like
How Much Does an AI Chatbot Cost to Build in 2026?

Vibe Coding vs Traditional Development 2026: What Businesses Need to Know

