What if … you had to build a data stack from scratch?

TLDR: it’s actually a lot of fun

What if … you had to build a data stack from scratch?

After 7 years working as Senior Data Manager at Criteo and as Head of Data at Payfit, and after interviewing 200+ data leaders for user research, I am starting to have a good overview of data stacks in action. So, by popular demand, I’m dropping here and there some advice on the trends I observed.

Earlier in March, I had two different calls: one from a former colleague, now head of data at Swile, and one from a friend, now head of data at Cajoo. Both had taken a new job and had to build a complete data stack, from scratch! I assume this could interest others. Here is the piece of advice I gave them.

Before we start, here is a Decision Matrix that you can use for benchmark purposes:

  • Ease of Use
  • Integration in Cloud environment
  • Community & Documentation
  • Governance Capabilities
  • Pricing

To keep things simple, we’ll just follow the data, from source to reporting.

Data Warehouse

Where you store your data for analytics needs

A data warehouse is a place you store and organize your data before analytics use it

My recommendation is simple:

  • If your dev stack runs on Google Cloud, then go for BigQuery (BQ)
  • If your dev stack runs on AWS, then go for Snowflake (SF)

Here are a few reasons.

  1. Redshift was indeed the first big player on the Cloud Data Warehouse market, but it means they’ve made mistakes that followers could avoid, and they haven’t changed much for the last few years.
  2. Redshift has a poor UI/UX, especially for end-users (data consumers).
  3. BigQuery and Snowflake are quite similar in terms of features and pricing models (based on usage: BQ bills you by scanned data, Snowflake bills you by query time). A slight difference is that the first one is stronger when the whole stack is run on Google Cloud while the second is slightly more “Analytics” focused, making it a perfect match for AWS users.

Snowflake is heading towards being a full data platform (multi-cloud, Snowpipe, jobs, data share, python/js functions) with a friendly user interface.

BigQuery’s strength is its integration within the Google Cloud systems: read/write from GCS, IAM rights management, billing, Google Spreadsheets native integration, …)

Ingestion

Get Data in your Data Warehouse

Stitch and Fivetran are extracting data from all your business sources and bring them to your Data Warehouse

For data originating from a SaaS, use StitchData or Fivetran. These two companies do a great job at maintaining these data dumps for their thousands of clients, and their pricing is linear as well as predictable. The sources they can retrieve data from is numerous, check these here and here.

You may want to cherry-pick some tables to ingest yourself, for frequency reasons or whatever. Make sure to leverage the numerous open-source taps made available here.

Want a mix between the fully managed and the fully homemade? Check Airbyte. You’ll have to host it but it comes with the necessary bunch of features, sources, and destinations.

Segment provides similar features, depending on the plan you’re on. But focuses more on event tracking, it’s a must-have to sync your marketing campaigns with user activity. Think of Segment as the big ears listening to what is happening on your website/app. It also demands an ongoing tagging plan. An alternative is Contentsquare which does not integrate that well with other data sources but does not require a tagging plan

Do not worry, there is still plenty of DE work to do: you’ve got ingestion pipelines to build for all your production data. If your PROD is in NoSQL (such as Mongo), then you’ve got the transformation to do before ingesting the data in your data warehouse. If your PROD is already in a SQL format, then you can dump all needed data in your analytics warehouse, and directly move on to the next paragraph

Transformation

From raw data to model/reports ready

Like any raw material, you want to transform it to deliver value. Use DBT to transform data.

First things first, ETL is dead, long live ELT (note the L now in the middle). In recent years, storage costs decreased to a point (< $30 per Tb/month) where you can now load all raw data in your data warehouse, and then do the transformation in SQL. Keep in mind that storage in SF and BQ is as cheap as S3 storage is, it’s computing that costs money.

Now, putting the spotlight on what became the obvious tool in 90% of the data teams I’ve talked to: the notorious DBT. There is even a new job title that comes with it: Analytics Engineer. DBT is open source, quite easy to set up, and comes with a bunch of interesting features oriented on quality, such as unit tests.

The general idea is to load all your data in your warehouse, then use DBT to create your datamart ie: refined data, ready to be used for analytics purposes.

As a side note, DBT is not a scheduler, you’ll still have to schedule it. It’s your scheduler who’ll launch the DBT executions… which naturally leads me to the scheduler section below. One last note: end of last year Google acquired Dataform, it’s now free and well-integrated with Big Query, but the community is not that big.

Scheduler

Run job 1 then run job 2

In any transformation process, there is a schedule. Things have to happen in a certain order. Use Airflow for data transformation scheduling.

The market standard for scheduling really boils down to Airflow, the question is which deployment set up you want? You can go for:

Astronomer helps you get off the ground faster. But there is a cost for that…

Data Viz

Make your shiny dashboards

Once you have a high-value dataset, you will want to show the output in the sexiest and most understandable way. Data visualization with Tableau, Looker, or Metabase is the best option.

This one is tough, as competition is fierce within this segment. This will also be the only part your business users will see, out of the whole stack you are building. Strong options are Tableau, Looker, or Metabase. I know that there are a lot of other options out there, that’s just my opinion

I’ll make a dedicated article on the matter, but to start, go for Metabase. First, on the data Viz side, it’s simple and straightforward, it takes a few minutes to go from a table to a dashboard. Keep in mind that for a while your dashboards will be simple, as you need to educate the different teams. Oh, and it’s free as long as you host it yourself.

Data Discovery

Find, understand and use data faster

High-value datasets or analyses are useless if no one knows they exist. Make sure people do. Castor is the tool for that.

Do you think that data discovery is only for people with too much data or many people? Did you know that after a few data dumps from Stitch or Segment, you’ll already have hundreds of tables?

On top of that, I’m quite sure that the first data analyst or engineer you’ll hire will make their own data documentation on a google doc or else. Wouldn’t it be better to have that info accessible to everyone and especially to the second hire?

At Castor, we provide just that!

About us

We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.

At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.

Want to check it out? Try our data catalog tool for free.

Subscribe to the Castor Blog


New Release
Share

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data