How to Calculate End-to-End Lineage

Computing cross tool lineage across your data infrastructure

5 min read

October 28, 2024

By Louise de Leyritz

Introduction

When your marketing team looks at monthly revenue in their Tableau dashboard, those numbers didn't just appear. They came from joining multiple tables, applying business logic and data transformations. But can you trace exactly where each piece of that data came from and how it was transformed along the way? This is what end-to-end data lineage helps you do.

This article is the third in our deep dive series on data lineage. In the first piece, I talked about the main use cases for data lineage. The second piece was dedicated to calculating the two main types of data lineage: field lineage and dataset lineage. In this article, I cover the third type: end-to-end data lineage, also called “hybrid lineage”.

End-to-end lineage tracks your data's complete journey - from raw warehouse tables, through transformation layers, to final visualization in dashboards.

Why does this matter? As your data infrastructure more complex, you’ll want to understand these connections for:

Troubleshooting data quality issues
Assessing the impact of schema changes
Meeting compliance requirements
Managing dependencies

In this article, I explore how to calculate end-to-end lineage in modern data environments.

Who is this for? This piece is aimed at analytics engineers, and data architects who work with multi-tool data stacks and need to understand or implement comprehensive data lineage solutions.

Let's dive in.

End-to-end lineage - Image courtesy of CastorDoc

I - What is end-to-end data lineage?

While dataset lineage and field lineage track data movement within a single system, end-to-end lineage follows the data across your entire data infrastructure.

Here's what makes end-to-end lineage different from field lineage and dataset lineage:

Cross-Tool Tracking: Instead of just seeing that Table A feeds into Table B, you can see that a column in your Snowflake warehouse feeds into a dbt model, which then powers a specific visualization in your Tableau dashboard.

Mixed Granularity: The lineage connects different levels of detail. It might start at a column level in your warehouse but end at a dashboard level in your BI tool. This is why it is also called "hybrid" lineage.

Tool-Specific Context: Each tool in your infrastructure has a different way of representing data. Looker has LookML models, Tableau has workbooks and worksheets, PowerBI has datasets. End-to-end lineage understands and connects these different representations.

Understanding these characteristics is important because they directly impact how we calculate this type of lineage - which is what I focus on in the following sections.

📌 Key Points:

End-to-end lineage tracks data across different tools, not just within one system
Connects different levels of detail (columns to dashboards)
Each tool in your stack represents data differently
Understanding these differences is crucial for calculation

II - Calculating end-to-end data lineage

Calculating end-to-end lineage requires different methods because data moves through different systems which, we have seen, have different ways of representing the data.

As data moves from source systems through ETL/ELT processes into data warehouses, then through transformation layers and finally into dashboards, each transition needs its own tracking approach. For this reason, we need to use different methods working together.

A - Tracking Lineage from Source to Warehouse

Core ETL/ELT Lineage

Most organizations load data into their warehouse through ETL/ELT processes. This first step of lineage tracking focuses on understanding how raw data moves from source systems into your warehouse.

Modern data pipelines (using tools like Fivetran, Airbyte, or custom ETL) typically:

Extract data from source systems (databases, APIs, files)
Load it into raw tables in your warehouse
Transform it into clean, analytics-ready tables

Each of these steps creates lineage metadata we can track:

We can capture this lineage in two main ways:

ETL Tool Metadata: ETL/ELT tools provide metadata about the operation they perform. They automatically log information about each job's source and destination, while also tracking changes like schema evolution and data volumes over time. If you’re interested in the topic, we’ve hosted a discussion with Fivetran on the matter. You can watch it here.

SQL Parsing: The second layer comes from analyzing the SQL transformations that shape our data. By examining these transformation queries, we can trace how raw tables evolve into analytics-ready datasets. Through pattern matching and SQL analysis, we can identify the relationships between source and destination tables, completing our comprehensive source-to-warehouse lineage picture.

Stream Processing

While batch ETL/ELT is the most common pattern, some organizations also need to track lineage for real-time data streams. This creates interesting challenges for end-to-end lineage.

Real-time data typically flows through these stages:

Application produces events to a stream (like Kafka)
Stream processor may transform or aggregate data
Warehouse connector loads data into raw tables
Regular transformation processes clean and model the data

Here's how you can track this flow:

For streaming data, you need to track:

Source to Stream: You need to identify which applications are producing events, track the evolution of event schemas over time, and maintain clear documentation of stream topics and their partitioning strategies. This creates the foundation for reliable streaming lineage.

Stream to Warehouse: The journey from stream to warehouse requires monitoring of how data ultimately lands in your warehouse environment. This includes tracking the frequency of data loads and their associated latency, as well as documenting any transformations that occur during the ingestion process. Understanding this middle layer is key to maintaining data quality and troubleshooting issues.

Integration with Batch Lineage: Perhaps most importantly, you need to understand how streaming and batch data work together in your ecosystem. This means documenting how these two types of data combine, which transformation steps they share, and mapping out which final analytics tables incorporate data from both sources. This holistic view ensures consistent data handling across your pipeline.

By tracking both batch and streaming lineage, you can build a complete picture of how data moves from source systems into your warehouse. This is essential for the next stage: tracking transformations within the warehouse itself.

📌 Key Points:

ETL/ELT tools provide essential metadata about source-to-warehouse data movement
Both batch and streaming data require different tracking approaches
SQL analysis helps complete the lineage picture by revealing transformation patterns
Stream processing adds complexity by introducing real-time data flows
Integration between batch and streaming lineage is crucial for complete tracking

B - Lineage Within Your Warehouse

Your data warehouse acts as the central hub where data from different sources comes together and is transformed into analytics-ready assets. I covered the details of warehouse-level lineage calculation in my previous piece on dataset and field lineage. Here, I focus on how warehouse transformations fit into the broader end-to-end picture.

Your warehouse serves three roles in end-to-end lineage:

Source Integration
- Consolidates data from multiple upstream sources
- Standardizes data formats and schemas
- Creates a single source of truth
Business Logic Layer
- Applies transformations that define business metrics
- Creates reusable analytical datasets
- Maintains consistent definitions across tools
Consumer Interface
- Provides clean data to BI tools
- Feeds ML models and applications
- Enables self-service analytics

For the technical details of how we calculate lineage within the warehouse, refer to my previous piece on the matter.

📌 Key Points:

The warehouse is the central hub connecting sources to consumers
Business logic in transformations defines your metrics
Clean, transformed data feeds downstream tools
Consistent naming and structure enable reliable tracking
Warehouse lineage connects upstream and downstream lineage

C - Dashboard & Reporting

The last piece of end-to-end lineage tracking focuses on how data flows into dashboards and reports for business consumption.

Understanding BI Components

Most BI tools share three core components that organize and display data:

Primary Containers (Workbooks): They act as folders that organize related visualizations. They maintain crucial metadata about data source connections and manage refresh schedules to keep your data current. These containers also handle access permissions, ensuring data security, while defining the relationships between different data elements within your reports.
Visualization Tiles: They form the interface between data and users. These are your charts, graphs, and tables that make data digestible. Each tile can pull from multiple data sources and apply specific transformations to present the information exactly as needed. With custom calculations and formatting, these tiles represent the final point of consumption for business users.
Data Models (VizModels) : They bridge the gap between raw data and visualizations. They act as the translation layer, defining how raw fields should be transformed and aggregated. These models maintain the connections to your data sources while applying business logic and calculations that give your data meaning. They ensure consistency in how metrics are calculated across your entire BI environment.

While these components serve similar purposes across tools, each BI platform implements them differently. Understanding these differences helps you calculating lineage correctly:

BI tools components - Image courtesy of CastorDoc

Tableau organizes everything in workbooks, with published data sources handling the connection to raw data
Looker uses LookML projects and explores, with view files defining the data model
PowerBI works through workspaces and datasets, with different data connection modes

These structural differences directly impact how we calculate lineage, which we'll explore in the next section on calculation methods.

Lineage calculation methods

When connecting tables to dashboards, there are two main calculation methods: SQL Query Parsing & API-based source detection.

1. SQL Query Parsing Method

In Tableau workbooks, we parse different elements to build lineage. We start with custom SQL queries embedded in data sources, which directly show table relationships. We also analyze calculated fields that reference multiple tables, creating additional lineage connections. Even when users create joins through Tableau's drag-and-drop interface, these ultimately generate SQL that we can parse for lineage information.

Looker's approach centers around LookML files. These files contain derived table definitions that specify table relationships and transformations. We parse these files to understand dimension and measure calculations, which show how fields are connected and transformed. Additionally, we analyze templated SQL fragments that define the core data transformations.

PowerBI requires a different parsing approach. We examine the SQL generated in DirectQuery mode (when PowerBI connects directly to databases) to track real-time data connections, and look at the transformations applied in Import mode. These two sources provide information about how data flows through the PowerBI environment.

2. API-Based Source Detection

Modern BI tools often use drag-and-drop interfaces that don't generate explicit SQL. Here, we rely on metadata APIs to understand data flow. This becomes particularly important when tracking how data moves from source tables to final visualizations.

Each platform provides different API capabilities:

The Tableau Metadata API allows us to explore workbook structure, showing how different components relate to each other. It maps upstream connections, showing where data comes from and how it flows through the system. The API also provides access to custom field definitions, which allows us to understand how field and metrics are calculated.
Looker's Exploration API focuses on LookML components. It shows view file relationships, displaying how different views are connected and co-dependent. The API exposes derived table dependencies, making it clear which tables feed into others. It also enables field-level lineage tracking, which help us understand how individual fields are used and transformed.
PowerBI's APIs provide access to different metadata layers. We can access dataflow metadata to understand how data is being processed and transformed. The APIs enable dataset relationship mapping, showing how different datasets are connected and interact. They also expose transformation tracking, showing data is modified as it moves through the PowerBI environment.

The challenge lies in normalizing this information across platforms to create consistent lineage records that connect smoothly with the warehouse and ETL tracking. This combination of SQL parsing and API-based detection allows us to complete the end-to-end lineage picture.

📌 Key Points:

BI tools share common components but implement them differently
Two main calculation methods: SQL parsing and API-based detection
Each platform has unique metadata structures
Need to normalize lineage information across platforms
Combine multiple approaches for complete dashboard lineage

New Release

Table of Contents

Why Look for Atlan Alternative?

Ressources

Louise de Leyritz

September 26, 2024

Dataset Lineage and Field Lineage: How to Compute Them?

Louise de Leyritz

August 6, 2024

Tracing Value: 6 Use Cases for Data Lineage

This article explores the strategic value and practical use cases of data lineage, demonstrating how it enhances data governance, compliance, migration, and overall data management efficiency.

Xavier de Boisredon

August 24, 2023

What is Data Lineage?

Discover the importance of data lineage for tracking and managing the flow of your data. Learn the fundamentals with Castor's expert guide.

Xavier de Boisredon

May 4, 2023

How Data Lineage Improves Data Quality?

Explore the bond between data lineage enhnaces data quality. Learn how tracking data's journey boosts its reliability for informed business decisions.

Contactez-nous pour en savoir plus

Découvrez ce que les utilisateurs aiment chez CastorDoc

Un outil fantastique pour la découverte de données et la documentation

« J'aime l'interface facile à utiliser et la rapidité avec laquelle vous trouvez les actifs pertinents que vous recherchez dans votre base de données. J'apprécie également beaucoup le score attribué à chaque tableau, qui vous permet de hiérarchiser les résultats de vos requêtes en fonction de la fréquence d'utilisation de certaines données. » - Michal P., Head of Data.

Introduction

I - What is end-to-end data lineage?

II - Calculating end-to-end data lineage

A - Tracking Lineage from Source to Warehouse

Core ETL/ELT Lineage

Stream Processing

B - Lineage Within Your Warehouse

C - Dashboard & Reporting

Understanding BI Components

Lineage calculation methods

You might also like

Contactez-nous pour en savoir plus