Dataset Lineage and Field Lineage: How to Compute Them?

Understanding and computing the two types of data lineage

Dataset Lineage and Field Lineage: How to Compute Them?

In my previous piece, I discussed the six main use cases for data lineage. This was the first piece of series where we dive deep into data lineage.

But there’s not just different use cases around lineage. There are also different types of data lineage, that live in different places of your data infrastructure and bring something different to the table. In this article, I focus on explaining what are these different types of lineage, and how they are calculated.

Each type offers insights into how data flows through your systems, transforms, and ultimately shapes your business decisions. In this piece, I first explore dataset lineage, which gives us a high-level view of data movement, and field lineage, which dives into the granular details of data transformation. By the end of this article, you'll have a comprehensive understanding of how to track and leverage these different lineage types in your data ecosystem.

Let’s get to it!

I. Dataset Lineage

We start with the most general type of data lineage; dataset lineage. Dataset lineage is the most high level form of data lineage, as it tracks the relationship between different datasets - as opposed to tracking the relationship between fields and columns. By definition, dataset lineage is less granular and somewhat easier to calculate.

Example of Dataset Lineage - Image courtesy of CastorDoc

This type of lineage focuses on understanding the relationships between data sources (such as tables) and the target objects (such as dashboards and workbooks) that use this data for reporting and analysis.

Dataset lineage can be divided into several core types based on the objects involved and the direction of data flow:

  1. Table-to-Table Lineage
  2. Table-to-Dashboard Lineage
  3. Dashboard-to-Dashboard Lineage

We look at each of these core types in turn.

1. Table-to-Table Lineage

Table-to-table lineage involves tracking the relationships between tables within a data warehouse. This type of lineage shows how different tables are interconnected through data processing tasks like ETL (Extract, Transform, Load) operations. It maps out how raw data from source tables is transformed and aggregated to create new tables, which are then used for analysis or as intermediate steps for further processing.

Computation Method

  • SQL Parsing: The primary method for establishing table-to-table lineage is through SQL parsing. SQL queries often define how new tables are created from existing ones by using statements such as "CREATE TABLE" or "SELECT INTO." By parsing these queries, lineage tools can detect which tables are being referenced as sources and which table is being created as a destination.
  • Pattern Recognition: SQL parsing involves looking for specific keywords and structures within the query, such as "FROM," "JOIN," and "UNION," followed by table names. This pattern recognition helps identify the parent (source) tables and the child (destination) table in the lineage.

A common example is when a sales data table is used to generate a summary table that aggregates sales by region or product. By parsing the SQL query responsible for this transformation, a lineage tool can trace back from the summary table to the original sales data table, capturing the flow and transformation of data.

2. Table-to-Dashboard Lineage

Table-to-dashboard lineage captures how data moves from tables in the data warehouse into dashboards used for reporting and visualization. This type of lineage helps map the relationship between data sources (tables) and the dashboards that present this data to end-users.

Computation Method

  • SQL Query-Based Lineage: In many cases, dashboards are populated by direct SQL queries that extract and transform data from tables. By parsing these queries, tools can link the tables used in the queries to the dashboards that display the results. This is the same method we use for computing table-to-table lineage.
  • Drag-and-Drop Interfaces: Some data visualization tools, like Tableau and Metabase, allow users to create dashboards using drag-and-drop interfaces without writing explicit SQL queries. In such cases, we compute the lineage by querying the API of the visualization tool to identify which tables are being used as data sources for the dashboards.

In a Tableau setup, a data source might directly reference a table in the warehouse. This table is then used to create a dashboard that visualizes sales trends over time. The lineage captures this connection, showing how raw data from the table feeds into the visual dashboard.

3. Dashboard-to-Dashboard Lineage

Dashboard-to-dashboard lineage represents the flow of data between various dashboard objects in business intelligence (BI) tools. It essentially maps out how data moves from one visual representation to another. It’s often an overlooked type of lineage, but it’s essential for understanding how data flows between different dashboard objects within visualization tools. It typically occurs in two main scenarios:

  1. Data Source to Visualization Objects: In tools like Tableau, this lineage shows how data flows from initial data sources to workbooks or dashboards for visualization.
  2. Aggregated Dashboards: In tools like Metabase, this lineage demonstrates how individual visualizations ("cards") contribute to larger, comprehensive dashboards.

Computation Method

The process of computing dashboard-to-dashboard lineage primarily relies on leveraging the APIs of visualization tools. The process goes as follows:

  1. Leverage BI Tool APIs:
    • For data source to visualization scenarios: Query the API to identify the data source used by a specific dashboard.
    • For aggregated dashboards: Use the API to determine which cards or components make up a particular dashboard.
  2. Translate API Responses: Convert the information obtained from the API into meaningful lineage connections within your lineage tool. This involves creating links between dashboards, cards, and data sources as appropriate.
  3. Follow the Data Flow: Remember to represent the lineage in terms of data flow. For instance, even if a dashboard is "composed of" cards in the BI tool, represent this as cards feeding into the dashboard in your lineage.
  4. Adapt to Tool Specifics: Be mindful of the unique structures and concepts used by different BI tools (e.g., Tableau's data sources and workbooks vs. Metabase's cards and dashboards). Adjust your lineage computation approach accordingly.

Example

In Metabase, a dashboard might consist of multiple cards, each representing a different data visualization. Some cards might be linked to other dashboards, allowing users to drill down into more detailed views. The lineage mapping would capture these connections, showing how data flows from one dashboard to another.

II. Field Lineage

Field level lineage - Image courtesy of CastorDoc

Field lineage traces data at a more granular level, specifically focusing on the relationships and transformations between individual fields (or columns) within tables and between fields in data visualization tools. Unlike dataset lineage, which deals with broader connections between tables or dashboards, field lineage offers a more detailed view, showing how specific data points are derived, transformed, and used across different data environments.

Field lineage can be categorized into three primary types based on the flow and interaction of data:

  1. Column-to-Column Lineage
  2. Column-to-Field Lineage
  3. Field-to-Field Lineage

1. Column-to-Column Lineage

Column-to-column lineage is the most granular form of data lineage. It tracks how individual columns within tables relate to each other, showing precisely how data points are manipulated, aggregated, or transformed during processing. This level of detail is crucial for understanding the exact journey of your data, especially in complex data warehouse environments.

Computation Method

The key to unraveling column-to-column lineage lies in SQL parsing. The overall process looks something like this:

  1. We start with the SQL query that defines the data transformation.
  2. Using a specialized SQL parsing library, we analyze the structure of the query.
  3. We identify the destination table and columns, typically found in INSERT INTO or CREATE TABLE statements.
  4. We then dissect the SELECT statement to understand which source columns contribute to each destination column.
  5. For complex queries involving subqueries, JOINs, or UNIONs, we perform a more intricate analysis to trace data flow accurately.
  6. We start with the SQL query that defines the data transformation.
  7. Using a specialized SQL parsing library, we analyze the structure of the query.
  8. We identify the destination table and columns, typically found in INSERT INTO or CREATE TABLE statements.
  9. We then dissect the SELECT statement to understand which source columns contribute to each destination column.
  10. For complex queries involving subqueries, JOINs, or UNIONs, we perform a more intricate analysis to trace data flow accurately.

Example

Let's consider a simple example:

In this case, our lineage would show:

  • foo_table.column3 -> base_table.column1
  • foo_table.column4 -> base_table.column2

However, real-world scenarios are often more complex. Consider a query with a UNION:

Here, our lineage becomes more intricate:

  • foo_table.column3 -> base_table.column1
  • foo_table.column4 -> base_table.column2
  • bar_table.columnA -> base_table.column1
  • bar_table.columnB -> base_table.column2

As you can see, column-to-column lineage provides a detailed view of data transformation. This level of granularity is really important for data governance, impact analysis, and ensuring data quality throughout your data pipeline.

Parsing complex SQL queries can be challenging, especially when dealing with nested subqueries, complex joins, or advanced SQL features. Lineage tools need to accurately interpret these queries to ensure correct lineage mapping.

2. Column-to-Field Lineage

As we delve deeper into data lineage, we encounter a transition point: the moment when data moves from our data warehouses into our visualization tools. This is where we start looking at column-to-field lineage.

Column-to-field lineage tracks the transition of data from columns within tables in a data warehouse to fields within data visualization tools. It provides a bridge between the structured data storage environment and the visual representation of data, showing how raw data points are brought into visualization contexts.

Computation Method

  • SQL Query Parsing: If the data visualization tool uses SQL queries to connect to the data warehouse, lineage can be established by parsing these SQL queries. The queries reveal which columns from the tables are being pulled into the visualization tool, linking them directly to fields in the tool.
  • API Integration: In cases where data visualization is created using drag-and-drop interfaces, APIs of the visualization tools are queried to understand the data sources being utilized. The API provides information about which columns are mapped to which fields, establishing the lineage.
  • Mapping Column Names: When parsing queries or using API data, the lineage is established by matching column names from the data warehouse tables with the field names in the visualization tools. This helps create direct links between the data source and its representation in the visual tool.

Example

Let's consider a scenario with a Tableau dashboard. Suppose we have a sales data table in our data warehouse with columns like 'date', 'product', 'region', and 'sales_amount'. When this data is brought into Tableau, the process might look like this:

  1. If using a SQL query:

Here, we'd establish lineage as:

  • sales_table.date -> Tableau_dashboard.date
  • sales_table.product -> Tableau_dashboard.product
  • sales_table.region -> Tableau_dashboard.region
  • sales_table.sales_amount -> Tableau_dashboard.total_sales
  1. If using a drag-and-drop interface: The Tableau API would inform us that the 'Sales Dashboard' is using the 'sales_table', and specifically using the 'date', 'product', 'region', and 'sales_amount' columns. We'd then create the same lineage connections as above.

Column-to-field lineage is crucial for maintaining data integrity and traceability as it moves from storage to visualization. It helps analysts and decision-makers understand where the data in their dashboards comes from, how it has been transformed, and ensures that visualizations are based on the correct and most current data.

3. Field-to-Field Lineage

Field-to-field lineage is concerned with tracing the flow of data between fields within different data visualization components. This type of lineage is less common but critical in scenarios where data visualization tools allow fields from one data source to be reused or transformed into another data source or visualization.

Computation Method

The method for computing field-to-field lineage varies depending on the visualization tool. Let's examine two popular BI tools with different approaches:

  1. Looker
    • LookML Parsing: Looker uses LookML files to define "explores" (Looker's term for data sources) and their relationships.
    • Structure Analysis: The lineage tool must parse these LookML files, understand their structure, and deduce relationships between explores.
    • Explore-to-Explore Tracing: Looker's modular approach allows explores to feed into each other, requiring the lineage tool to trace these connections.
    • Example: In Looker, you might have an explore for 'Sales' that feeds into a more complex 'Sales Performance' explore. The lineage tool would need to parse the LookML files to understand that the 'Total Revenue' field in 'Sales Performance' is derived from the 'Revenue' field in the 'Sales' explore.
  2. Tableau
    • Calculated Field Analysis: Tableau allows users to create calculated fields within workbooks.
    • Workbook XML Parsing: Tableau workbooks can be saved as XML files, which contain information about field relationships.
    • Visual Model Interpretation: For some visualizations, the tool may need to interpret how fields are used in the visual model.
  3. Example: In a Tableau workbook, you might have a calculated field 'Profit Ratio' that uses the fields 'Revenue' and 'Cost'. The lineage tool would need to parse the workbook XML to understand that 'Profit Ratio' is derived from these two fields.

Conclusion

Dataset lineage and field lineage serve distinct purposes in data management. Dataset lineage provides a high-level view of data flow between tables and dashboards, crucial for understanding overall data movement. Field lineage offers granular insight into data transformations at the column level, essential for detailed analysis and troubleshooting.

Both types of lineage are computed through SQL parsing, API integrations, and pattern recognition. The choice between them depends on the specific use case and level of detail required.

Implementing both forms of lineage enables organizations to maintain data integrity, ensure compliance, and make informed decisions about data usage. As data ecosystems grow more complex, a solid grasp of these lineage types becomes increasingly vital for effective data governance and management.

CastorDoc offers comprehensive data lineage tracking across various sources, covering both dataset and field levels. It provides a robust solution for companies seeking to implement an effective lineage solution. To see it in action, chat with the CastorDoc team.

New Release
Share

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data