Ortem Technologies
    AI & Machine Learning

    How to Build a Data Pipeline: Architecture, Tools & Best Practices (2026)

    Ravi JadhavMarch 7, 202613 min read
    How to Build a Data Pipeline: Architecture, Tools & Best Practices (2026)
    Quick Answer

    A data pipeline moves data from source systems to destinations where it can be analysed. The core components are: ingestion (extracting from sources), storage (landing in a data lake or warehouse), transformation (cleaning, modelling), and serving (BI tools, ML models). The modern default stack is Airbyte (ingestion) + Snowflake/BigQuery (warehouse) + dbt (transformation) + Airflow or Dagster (orchestration) + Metabase/Tableau (BI). Start with batch pipelines before building streaming — most analytical use cases do not need real-time data.

    Commercial Expertise

    Need help with AI & Machine Learning?

    Ortem deploys dedicated AI & ML Engineering squads in 72 hours.

    Deploy Private AI

    What Is a Data Pipeline?

    A data pipeline is an automated system that moves data from where it is generated (source systems: databases, APIs, event streams, SaaS tools) to where it is analysed (data warehouse, data lake, BI tools, ML models).

    The pipeline handles extraction, validation, transformation, loading, and scheduling — automatically, reliably, and with alerting when something breaks.

    Batch vs Streaming Pipelines

    Before designing your pipeline, decide which model you need:

    Batch processing: Data moves in scheduled chunks (hourly, daily). Simpler to build, easier to debug, sufficient for most analytical use cases. 90% of business intelligence does not need real-time data.

    Streaming: Data moves continuously as events occur (milliseconds to seconds latency). Required for: real-time fraud detection, live dashboards, event-driven ML models, operational analytics.

    Recommendation: Start with batch. Add streaming only when you can clearly articulate a business decision that requires sub-minute data freshness.

    Core Pipeline Components

    1. Ingestion Layer

    Extracts data from sources and lands it in your storage layer.

    Managed connectors (recommended):

    • Fivetran — 500+ pre-built connectors, fully managed, zero maintenance
    • Airbyte — open-source, 350+ connectors, self-hosted or cloud
    • Stitch — simple, affordable, good for small teams

    Custom ingestion (when connectors don't exist):

    • Python scripts using requests, SQLAlchemy, or vendor SDKs
    • Apache Kafka or AWS Kinesis for streaming sources

    2. Storage Layer

    Where raw data lands.

    Data warehouse (structured, SQL queries, analytics):

    • Snowflake — best overall, any cloud, pay-per-query
    • BigQuery — best on GCP, serverless, excellent for large datasets
    • Redshift — best on AWS, good if you are AWS-native

    Data lake (raw files, semi-structured, large scale):

    • AWS S3 + Glue/Athena
    • Azure Data Lake Storage + Synapse
    • Databricks Lakehouse — combines lake and warehouse

    3. Transformation Layer

    Cleans, joins, and models raw data into analytics-ready tables.

    dbt (data build tool) is the standard for warehouse transformations:

    • Write transformations as SELECT statements
    • dbt handles materialisation, dependency ordering, and documentation
    • Integrated testing (not-null, unique, referential integrity checks)
    • Version control your transformation logic in Git
    -- dbt model: orders_daily.sql
    SELECT
        date_trunc('day', created_at) AS order_date,
        COUNT(*) AS total_orders,
        SUM(total_amount) AS revenue,
        AVG(total_amount) AS avg_order_value
    FROM {{ ref('stg_orders') }}
    WHERE status = 'completed'
    GROUP BY 1
    

    4. Orchestration Layer

    Schedules pipeline runs and manages dependencies between steps.

    • Apache Airflow — most mature, large ecosystem, requires operational overhead
    • Prefect — Python-native, excellent developer experience, managed cloud option
    • Dagster — asset-based, great for data teams, built-in data lineage

    5. Serving Layer

    Delivers processed data to consumers.

    • BI tools: Metabase (open-source), Looker, Tableau, Power BI
    • ML features: Feature stores (Feast, Tecton)
    • APIs: Expose aggregated data via REST or GraphQL for product features

    Data Quality and Observability

    A pipeline without monitoring is a liability. Implement:

    • Schema validation: Assert expected columns and types on ingestion
    • dbt tests: not_null, unique, accepted_values, referential integrity
    • Row count checks: Alert when daily load is 30% below historical average
    • Freshness checks: Alert when data is older than expected
    • Tools: Monte Carlo, Great Expectations, dbt Cloud observability

    Pipeline Architecture Example

    Postgres DB → Airbyte → Snowflake (raw schema)
                                  ↓
    Salesforce  → Airbyte → Snowflake (raw schema)
                                  ↓
    Stripe API  → Airbyte → Snowflake (raw schema)
                                  ↓
                             dbt transforms
                                  ↓
                        Snowflake (analytics schema)
                                  ↓
                        Metabase / Tableau / Looker
    

    Build your data infrastructure with Ortem. Talk to our data engineering team → or contact us to scope your data pipeline project.

    📬

    Get the Ortem Tech Digest

    Monthly insights on AI, mobile, and software strategy - straight to your inbox. No spam, ever.

    Data PipelineData EngineeringData ArchitectureETLModern Data Stack

    About the Author

    R
    Ravi Jadhav

    Technical Lead, Ortem Technologies

    Ravi Jadhav is a Technical Lead at Ortem Technologies with 12 years of experience leading development teams and managing complex software projects. He brings a deep understanding of software engineering best practices, agile methodologies, and scalable system architecture. Ravi is passionate about building high-performing engineering teams and delivering technology solutions that drive measurable results for clients across industries.

    Technical LeadershipProject ManagementSoftware Architecture

    Stay Ahead

    Get engineering insights in your inbox

    Practical guides on software development, AI, and cloud. No fluff — published when it's worth your time.

    Ready to Start Your Project?

    Let Ortem Technologies help you build innovative solutions for your business.