Cloud Data Warehousing: Past, Present, and Future
Data Warehouse Benchmark and Market Analysis
It was 2013 and my job was to sell a fleet management solution to the largest motorbike manufacturer in India but the deal would only go through if they first invested in a new database system that would make our solution viable. Together with a partner vendor, I found myself pitching SAP Hana as a modern database solution that is ideal for both transactional and analytic workloads.
At the time, collecting, storing, and analyzing data at scale required significant investment and was viable only for enterprises that had budgeted a few hundred thousand dollars and a few months in human capital. Fast forward to the present and it is table stakes for an organization serious about building a data culture to invest in a data warehouse. It’s no surprise that this shift is being driven by the innovation taking place in the market that has led to better, more affordable data warehousing solutions.
Before exploring the market for cloud data warehousing tools, it’s worth shedding some light on why companies need a data warehouse as well as the factors leading to its rapid adoption.
Data Warehousing Needs and Benefits
Due to an explosion of purpose-built SaaS tools, there is more data and more data sources than ever before. At the same time, organizations big and small have come to terms with the fact that providing personalized experiences powered by data is the only way to attract and retain customers.
This is made possible only when teams are able to analyze and activate data collected across every customer touchpoint, and while there are suites of SaaS products that can help with that, storing a copy of one’s data in a warehouse can unlock so many more possibilities. Moreover, not storing a copy of the data leads to vendor lock-ins, inefficiencies, and frustration for teams that need ad-hoc analyses or a specific dataset to be made available in an external tool they rely on.
Let me now highlight the obvious benefits of adopting a data warehouse.
Affordable and Easy to Implement
The usage-based pricing and separation of storage and compute (more on that below) have made it extremely affordable for businesses small and big to adopt a warehouse even with limited data volume and few data sources. In fact, most data warehousing vendors offer trials and free monthly quotas for storage and analysis.
In terms of implementing a data warehouse, you can spin one up in a matter of hours (or minutes if you’ve done it before) without the need to write any code whatsoever. And using an ELT tool such as Fivetran or Airbyte, you can ingest data into the warehouse from pretty much every data source imaginable. Needless to say that one has to write SQL to run queries on the data or to build data models that can be consumed in downstream applications. But speaking as a non-engineer, SQL isn’t hard and here are some great resources to learn SQL.
Performant and Scalable
One of the biggest benefits of cloud data warehouses is that, unlike their predecessors, modern warehousing tools are built on an architecture that separates compute from storage.
It implies that what you pay to store data is separate from the cost to run queries on the data. This not only brings cost benefits but also makes cloud data warehouses more performant with the ability to concurrently run hundreds of queries. I recommend this quick read to learn more about concurrency in cloud data warehouses.
Moreover, with storage becoming cheaper every day, companies are able to replicate their production databases in their data warehouses, making the warehouse the source of truth for all data.
Data Warehousing Market Overview
Let’s look at how the market for data warehousing solutions has evolved in the last decade and where it is headed.
The Past
At the beginning of this post, I mentioned that a deal I was trying to close relied on the buyer investing in SAP Hana which at the time was available as an on-premise solution with an upfront cost of a couple of hundred thousand dollars. You can imagine, therefore, that it took more than a few sales meetings to get companies to part with that kind of money.
(In case you’re curious, that deal of mine was stuck in a loop for months until I stopped following up.)Other leading warehousing solutions were offered by Oracle, IBM, and Vertica. These were deployed on-premise and combined transactional (OLTP) and analytical (OLAP) processing. The key differences between OLTP and OLAP are explained in this short article from IBM.
With the ubiquity of the cloud, all of these vendors now offer cloud warehousing solutions; however, they make no appearance when people talk about modern cloud warehousing tools.
The Present
It’s 2021 and the data warehousing market is at a unique juncture. For the reasons mentioned above, the leading providers of cloud warehousing solutions are experiencing rapid adoption. At the same time, an established data company and a challenger upstart are trying to capture a piece of the pie.
Leading Horses
Amazon Redshift, Google BigQuery, and Snowflake are by far the three most popular solutions that are talked about in the context of cloud warehousing, followed by whatever Microsoft Azure calls its warehousing solution.
Redshift was launched in 2012 as the first OLAP database in the cloud that was quick and cheap to get started with. It also marked the beginning of modern cloud data warehousing as we know it today. While BigQuery came to market before Redshift, it was only in June of 2016, almost 5 years after general availability, that Google launched the support for standard SQL that led to wider adoption.
However, since its public launch in 2014, Snowflake was the first to deliver a solution that separated compute from storage, an architecture that has since been adopted by other providers (although many Redshift customers are probably still not running on the cluster that supports this architecture).
If you’d like to dig deeper and understand the differences between Redshift, BigQuery, and Snowflake, I highly recommend this in-depth comparison. Or you can check out our CPO's recommendation on how he'd build his tech stack here.
As mentioned above, Microsoft too has a horse in the race that I’ve been told is well-suited for companies running on the Azure cloud.
It’s also worth mentioning that Snowflake, after establishing itself as a leading data warehouse vendor, is now positioning itself as an end-to-end cloud data platform that amongst other things, can now store and process unstructured data.
A Decacorn and A Challenger
The present warehousing landscape includes two more companies — Databricks, the $38B decacorn, and Firebolt, the challenger. Any evaluation of data warehousing solutions would be incomplete without taking into consideration what these two companies have to offer.
Founded by the creators of Apache Spark, Databricks started by offering a managed solution for Spark, an analytics engine meant to process large volumes of data, typically used by large companies to manage machine learning workloads.
Today, Databricks offers a product called Lakehouse that combines the capabilities of a data lake and a data warehouse.
Historically, data lakes have been used to store raw, unstructured data that didn’t have an immediate use case. Data warehouses, on the other hand, were designed to store structured data that was prepared or transformed for the purpose of analytics. However, with the rise of ELT over ETL driven by warehousing solutions becoming cheaper and more performant, the lines between a data lake and data warehouse are blurring. It’s worth noting that slowly but surely, Snowflake and Databricks are converging -- if you’re curious to know more, my friend, Annika has done a fantastic job covering this convergence.
Moving on, Firebolt has emerged as a challenger in the space by making some really bold claims and doing a fine job at explaining how they do it (highly recommend watching this product demo).
I’m a fan of how quickly Firebolt has positioned itself as a serious contender in a market dominated by multi-billion dollar giants, while also having fun!
The Future
The market leaders of the present are not getting disrupted anytime soon and are likely to forge ahead as leaders in the years to come.
Firebolt has definitely given us a glimpse of what the future of data warehousing looks like and as it gains momentum, it is expected that Snowflake and others won’t just stop and stare. However, there are a few purpose-built analytics database systems that are pushing the boundaries of what is achievable with the current warehousing solutions. ClickHouse, Apache Druid, and Apache Pinot are all open-source OLAP databases/datastores/database management systems meant for real-time analytics use cases. Materialize is another database startup that is making strides in the real-time analytics arena.
These are not general-purpose data warehousing solutions as they’re meant to cater to real-time use cases only, and are different in their architecture that doesn’t decouple storage and compute — a core premise of modern data warehousing tools.
If you’re looking for a super detailed comparison between ClickHouse, Druid, and Pinot, this is it. By no means are these “database systems” a replacement for a data warehouse but they do give us a glimpse of what to expect from products that enable storing large volumes of data for the purpose of analytics.
Conclusion
Modern data warehousing tools have made it really easy and affordable to do two things — store a lot of data in the cloud and query that data using SQL for analysis and activation purposes.
Companies that are serious about having control and ownership over their data but choose to not invest in a data warehouse are, well, not serious about it after all. There are companies that use a PostgreSQL database as their data warehouse and while that is okay in the short run, with data warehouses becoming cheaper and more performant, there is really no excuse to not set one up sooner rather than later.
Once you have a modern data warehouse, you'll need a modern data catalog - check out all the data warehouses Castor can integrate with here.
Subscribe to the Castor Blog
More modern data stack benchmarks?
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and get a free 14 day demo.
You might also like
Get in Touch to Learn More
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data