How Do Data Discovery and Observability Interact?
Data observability is the back-end, Data Discovery is the front-end
Data observability tools and data catalogs have become vital components of the modern data stack. This hardly comes as a surprise. Data volumes are exploding, and tools to move and manipulate this data are multiplying at an astonishing speed. And the more data, the more tools, the hardest it is to find, understand and trust the data you have. There are two key issues at stake which have recently brought data observability and data discovery solutions to the center stage:
- Data pipelines, in charge of moving data from one system to another, have become extremely complex. As data manipulation tools multiply, data pipelines run with a higher number of interconnected parts. Data is more likely to break throughout the pipelines, and it becomes harder to locate where failures occur or understand how to fix the problems. As a result, data engineers spend days trying to locate data issues, and levels of trust in the data plummet across organizations.
- As the number of datasets in organizations grows exponentially, documentation becomes an overwhelmingly demanding task. It is simply impossible to manually document thousands of tables and millions of columns. As a result, data users spend more time trying to find and understand data than producing value-generating data analysis.
But to new problems, new solutions. Data observability tackles the first problem, and data discovery the second. This article is dedicated to explaining what these tools are, and how they can work in tandem to bring trust to your company's data.
Data Observability & Data discovery: what are they?
Data observability refers to an organization's ability to fully understand the health of the data in its system. Observability enables constant monitoring of the behavior of the data in data pipelines (such as data volume, distribution, freshness). Based on these insights, it creates a holistic understanding of a system, its health, and performance. Having "observability" over your data pipelines means your data team can quickly and accurately identify and prevent problems in the data. Data observability tools combine features like instrumentation, monitoring, anomaly detection, alerting, root cause analysis, and workflow automation. This helps the data team find, fix, and prevent problems from messing up their analytics and ML models.
Data discovery tools empower anyone who uses data to find and understand the data they need quickly and accurately. Discovery focused catalogs (as opposed to control focused catalogs common in enterprises who need to carefully manage exactly who has access to what data for how long) combine features like search, popularity ranking, query history, lineage, tagging, Q&A history, and other features to all the data in the company's warehouse. This helps the data team move faster with less confusion and less risk of misuse.
But why do “quickly” and “accurately” matter so much?
Quickly matters because many of the people who work with data are hard to find, expensive to employ, and overloaded with work every week. Shaving off the 10 minutes it takes to ping people on Slack, find and read through outdated docs, or run a bunch of repetitive SELECT * queries adds up. Especially when everyone is quietly doing it 5-10-20 times a week. If you have a 10 person data team that costs $900 000 per year on average, you could easily be losing $70 000 per year, AND it increases as your team scales up.
Accurately matters because when team members are doing real work for the business, the correctness of the work depends on whether they used the data correctly. Someone doing analysis to inform growth spending for the year might be forced to guess which of 3 timestamp columns to use: the time an event was logged on the user's mobile device, the time the event hit the company's gateway, the time the record was written into the database. Except they might be named things that made sense to the engineer who's logging them but are hard to understand for everyone else: ts, created_at, and local_time. The final analysis usually ends up being pure guesswork from the data users.
On the need of a comprehensive solution
Speed and accuracy are complex problems. Teams that push off on solving these problems fully may still get some of the benefits of manual documentation, or basic pipeline test coverage, but they're missing out on the productivity gains that come from knowing that both problems are being solved comprehensively. In fact, speed and accuracy have a convex value as a function of investment, which means you should either go all in or not do it at all. Half-hearted efforts won't get you anywhere. This is what drove some of the best-funded data teams in the world at companies like Uber, Airbnb, Lyft, and LinkedIn to build not just tacky partial solutions, but full-blown internal products for both of these problem areas.
When teams have both of these challenges solved comprehensively, they get a magic bonus feature: trust.
Trust is this nebulous thing that's hard to quantify but that every organization deeply wants. Nobody likes questioning the data they're using to decide whether to continue spending $50,000 per month on an ad campaign, or whether the new product feature actually improved engagement by 11%, or whether the model predictions users see are going whack.
💡 But to have trust, the organization needs to know two things: are we using the data the right way, and is the data actually working the way it's supposed to.
Discovery products like Castor are solving the former, and observability products like Bigeye are solving the latter. But what happens when you have both?
At the planet-scale companies that have built these products internally, like Uber, Airbnb, Netflix, and LinkedIn, they're often deeply integrated—look at their blog posts about Databook, Metacat, and Datahub for examples. This gives the data team all the metadata they need to know that they're using the data the right way, that the data is working as expected, and it does all that without adding friction to their workflow.
Now the data scientist knows which of the three timestamp columns to use. And they know that for the last 90 days there haven't been any nulls, dupes, or future values hiding in there that would mess up their analysis. And they know all of this in about 10 seconds thanks to a well-coupled workflow.
Why do you need a standalone data catalog and data observability platform?
You've probably understood it by now, tackling the speed and accuracy problems altogether in a comprehensive manner will get you where you want to be in terms of data strategy. But a question remains: does solving these challenges comprehensively mean going for a single, all-in-one tool for data observability and data cataloging, or should you invest in two separate tools that integrate seamlessly? We personally recommend going for two different tools, for obvious reasons.
The discovery and observability problems are peculiar, in the way that the primary users of data discovery solutions are different from the primary users of observability platforms. Data analysts, product scientists, and other consumers of data care about data discovery, as they need to find, understand, and use the data to do their work. Data engineers on the other side, care more about observability because they’re responsible for fixing and preventing issues in the data pipelines to ensure that data is reliable.
These two profiles have fundamentally different workflows. Data engineers want to capture data quality information in tools like PagerDuty in their own internal systems, while data analysts want to see this information displayed in the data catalog. Even in terms of UI, data engineers and data analysts have different expectations. A standalone data observability platform can cater to a data engineer’s workflows, while a standalone data catalog can cater to the workflows of analysts and business users without compromising either.
Data discovery and Data Observability are two complex problems, which need great expertise to be solved effectively. For this reason, many data teams will be better off combining best-of-breed solutions to each problem, so long as the two solutions integrate seamlessly.
Final thoughts
It's extremely difficult to leverage your data without data discovery and data observability. It will become ever more so as organizations collect increasing volumes of data, and as data pipelines process become more complex. Solving both the speed and accuracy simultaneously will establish deep-rooted trust in the data assets you own. The best way of solving these problems comprehensively is to combine best breed discovery and data observability solutions that integrate together. Should you need any help choosing, we've put together benchmarks of the observability and discovery solutions out there.
Subscribe to the Castor Blog
About us
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Try our Data Discovery tool with a free demo!
You might also like
Get in Touch to Learn More
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data