DocMaster - Auto Document Your Data Warehouse
Isn't it the dream of any data person to have well-documented data? / No more expensive mistakes, nor "is payment is € or $?"
Every time you ask a data engineer or data scientist "Would you like to have all your data documented?". The answer is "YES!". A big and loud "YES!".
The next natural question would be "Do you want to document it yourself ?" and this time, answers are less enthusiastic "hmm, no" or "not really" or even "I don't even want to come close to this !"
Enters DocMaster
What is DocMaster?
DocMaster is a feature of Castor that documents automatically your columns and tables coming from common data sources (Salesforce, Google Analytics, Zendesk, etc). We created a documentation repository with high-quality definitions for the fields of popular data sources. Now, when we get access your metadata, we can identify the table and fields coming from those sources, with 99% accuracy and document it for you.
DocMaster gathers documentation from many sources (like Salesforce, Google Ads, Zendesk, Marketo, Jira, etc). It locates the data tables linked to these sources in your data warehouse (BigQuery, Redshift, Snowflake, etc) and automatically adds metadata to it.
For example, in the screenshot below, the table "account" from the "Salesforce" schema is synced through ETL tools like Fivetran or Airbyte. Documentation for this "salesforce.account" table is available online. We auto-populate it with DocMaster.
And we do the same with 61 sources. You'll find below some of the sources that we are currently able to document for you.
DocMaster's vision is to bring as much documentation as possible in an automated yet 100% accurate way. We will be adding new sources every month, and increasing our repository.
Some Stats on DocMaster
Completeness
DocMaster is ready to document up to 27% of your data warehouse in a few minutes. At Castor, we have a 5:1 ratio of automatically documented data to manually documented data.
Accuracy
DocMaster is designed to be always right. The system is conservative by design, which means that false positives (adding a wrong description for a column or table) are very unlikely. There are multiples levels and mechanisms of validation before the approval of a description.
The main idea for the design was to allow descriptions to be added if and only if the 4 following steps were completed:
- Perfect matching in column name
We ensure that after the pre-processing step, the name of the column corresponds exactly to the term we have defined in DocMaster. If it does, then we add the description to the column.
Here are a few rules we implemented:
- The column isn't private (ends with _sdc, or __c, etc...) which means created by the customer.
- The proposed description isn't already used in the same table
- One column can't appear twice in the same table. If there is already a perfect match, the second match is a false positive. We added pre-processing steps to remove some special characters ('_', '-', making everything lowercase) in order to standardize perfect matching.
- Almost perfect matching in tables name
Sometimes, based on ETL tool providers tables names can differ. We computed a similarity score based on the table name and all the columns in the table.
- The table name should be close enough (not perfect, but almost there)
- There is a table similarity score, where we set a threshold
- Allowed schema combination
- Only if the source name is inside the schema name (ex: 'facebook' in stg_facebook_ads_reports)
- Minimum average score for all tables inside the schema
- A minimum number of columns inside this schema (10 for our case)
Here is an example :
Going Further
At Castor, we dream of a world where data is accessible throughout an entire organization, and to everyone regardless of their data literacy level. To achieve this goal, we always focus on bringing solutions that do not require effort from our users. Tools like the open-sourced Amundsen are great. They can have a huge impact (as they did at Lyft) but to explore their full potential, it's necessary to invest a lot more than just deploying it. You need motivation, people, and processes to make a tool like Amundsen stick
The biggest challenge with Amundsen is the cultural change needed inside your corporation. Most of the features need engineering effort to set up and maintain. Then, you need active leadership to push everyone in the company to write documentation. This means that, if everybody doesn't get on board with the idea, the quality of documentation is going to degenerate very quickly. You might have noticed already that sometimes bad documentation is worse than no documentation at all.
At Castor, our focus is to maximize the "time spent/impact" ratio. 10 minutes on Castor is worth two hours on an excel spreadsheet. We built Castor with 4 core focus:
- Lots of automation
We do a lot of the work for you, finding popular data, declaring lineages, documenting your columns and tables, etc... in short, we let you focus where you really need to.
- Focus on what really matters.
A lot of ours features are focused on this aspect, such as looking for the popularity of the tables and columns, the most used SQL's, etc..
- Collaborative workspace.
You don't need to repeat the work done by your peers.
- Superpower your work.
When documenting your data, we let you propagate the information throughout your entire data warehouse, effectively turning 1 hour of work, into 10.
Want to try what I built? More information here.
Subscribe to the Castor Blog
Victor Athanasio
About me.
I'm Victor, really happy to be part of the Beaver Gang 🦫, and as some of you, also a coffee addict ☕️ .
Originally from Brazil 🇧🇷 I'm now pursuing an engineering double degree in France 🇫🇷. This project was developed in a partnership with CentraleSupélec (Paris) through an internship program.
Here's a gif that I designed and wanted to share:
You might also like
Get in Touch to Learn More
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data