What if … you had to build a data stack from scratch?
TLDR: it’s actually a lot of fun
After 7 years working as Senior Data Manager at Criteo and as Head of Data at Payfit, and after interviewing 200+ data leaders for user research, I am starting to have a good overview of data stacks in action. So, by popular demand, I’m dropping here and there some advice on the trends I observed.
Earlier in March, I had two different calls: one from a former colleague, now head of data at Swile, and one from a friend, now head of data at Cajoo. Both had taken a new job and had to build a complete data stack, from scratch! I assume this could interest others. Here is the piece of advice I gave them.
Before we start, here is a Decision Matrix that you can use for benchmark purposes:
- Ease of Use
- Integration in Cloud environment
- Community & Documentation
- Governance Capabilities
- Pricing
To keep things simple, we’ll just follow the data, from source to reporting.
Data Warehouse
Where you store your data for analytics needs
My recommendation is simple:
- If your dev stack runs on Google Cloud, then go for BigQuery (BQ)
- If your dev stack runs on AWS, then go for Snowflake (SF)
Here are a few reasons.
- Redshift was indeed the first big player on the Cloud Data Warehouse market, but it means they’ve made mistakes that followers could avoid, and they haven’t changed much for the last few years.
- Redshift has a poor UI/UX, especially for end-users (data consumers).
- BigQuery and Snowflake are quite similar in terms of features and pricing models (based on usage: BQ bills you by scanned data, Snowflake bills you by query time). A slight difference is that the first one is stronger when the whole stack is run on Google Cloud while the second is slightly more “Analytics” focused, making it a perfect match for AWS users.
Snowflake is heading towards being a full data platform (multi-cloud, Snowpipe, jobs, data share, python/js functions) with a friendly user interface.
BigQuery’s strength is its integration within the Google Cloud systems: read/write from GCS, IAM rights management, billing, Google Spreadsheets native integration, …)
Ingestion
Get Data in your Data Warehouse
For data originating from a SaaS, use StitchData or Fivetran. These two companies do a great job at maintaining these data dumps for their thousands of clients, and their pricing is linear as well as predictable. The sources they can retrieve data from is numerous, check these here and here.
You may want to cherry-pick some tables to ingest yourself, for frequency reasons or whatever. Make sure to leverage the numerous open-source taps made available here.
Want a mix between the fully managed and the fully homemade? Check Airbyte. You’ll have to host it but it comes with the necessary bunch of features, sources, and destinations.
Segment provides similar features, depending on the plan you’re on. But focuses more on event tracking, it’s a must-have to sync your marketing campaigns with user activity. Think of Segment as the big ears listening to what is happening on your website/app. It also demands an ongoing tagging plan. An alternative is Contentsquare which does not integrate that well with other data sources but does not require a tagging plan
Do not worry, there is still plenty of DE work to do: you’ve got ingestion pipelines to build for all your production data. If your PROD is in NoSQL (such as Mongo), then you’ve got the transformation to do before ingesting the data in your data warehouse. If your PROD is already in a SQL format, then you can dump all needed data in your analytics warehouse, and directly move on to the next paragraph
Transformation
From raw data to model/reports ready
First things first, ETL is dead, long live ELT (note the L now in the middle). In recent years, storage costs decreased to a point (< $30 per Tb/month) where you can now load all raw data in your data warehouse, and then do the transformation in SQL. Keep in mind that storage in SF and BQ is as cheap as S3 storage is, it’s computing that costs money.
Now, putting the spotlight on what became the obvious tool in 90% of the data teams I’ve talked to: the notorious DBT. There is even a new job title that comes with it: Analytics Engineer. DBT is open source, quite easy to set up, and comes with a bunch of interesting features oriented on quality, such as unit tests.
The general idea is to load all your data in your warehouse, then use DBT to create your datamart ie: refined data, ready to be used for analytics purposes.
As a side note, DBT is not a scheduler, you’ll still have to schedule it. It’s your scheduler who’ll launch the DBT executions… which naturally leads me to the scheduler section below. One last note: end of last year Google acquired Dataform, it’s now free and well-integrated with Big Query, but the community is not that big.
Scheduler
Run job 1 then run job 2
The market standard for scheduling really boils down to Airflow, the question is which deployment set up you want? You can go for:
- Self-Managed: dockerized airflow
- Managed: AWS version / GCP Version
- Fully-Managed: Astronomer
Astronomer helps you get off the ground faster. But there is a cost for that…
Data Viz
Make your shiny dashboards
This one is tough, as competition is fierce within this segment. This will also be the only part your business users will see, out of the whole stack you are building. Strong options are Tableau, Looker, or Metabase. I know that there are a lot of other options out there, that’s just my opinion
I’ll make a dedicated article on the matter, but to start, go for Metabase. First, on the data Viz side, it’s simple and straightforward, it takes a few minutes to go from a table to a dashboard. Keep in mind that for a while your dashboards will be simple, as you need to educate the different teams. Oh, and it’s free as long as you host it yourself.
Data Discovery
Find, understand and use data faster
Do you think that data discovery is only for people with too much data or many people? Did you know that after a few data dumps from Stitch or Segment, you’ll already have hundreds of tables?
On top of that, I’m quite sure that the first data analyst or engineer you’ll hire will make their own data documentation on a google doc or else. Wouldn’t it be better to have that info accessible to everyone and especially to the second hire?
At Castor, we provide just that!
About us
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Try our data catalog tool for free.
Subscribe to the Castor Blog
You might also like
Contactez-nous pour en savoir plus
« J'aime l'interface facile à utiliser et la rapidité avec laquelle vous trouvez les actifs pertinents que vous recherchez dans votre base de données. J'apprécie également beaucoup le score attribué à chaque tableau, qui vous permet de hiérarchiser les résultats de vos requêtes en fonction de la fréquence d'utilisation de certaines données. » - Michal P., Head of Data.