During a recent conversation with one of our clients, we were discussing market data for commodities and the best way to store and access them. For the uninitiated, commodity-related data is a broad topic that spans prices, storage, generation, distribution, weather, physical market mechanics and a myriad of other data. Given the client’s needs, we proposed a data warehouse over a data lake. What we did not expect was strong pushback because “…data warehouse projects are a big no-no here.”
It turned out that the client’s resistance stemmed from a mix of past failures, misalignment in terminology and skepticism that a data management approach could be quickly deployed. After a series of conversations, we got to the right place for our client and they are now happily using their data-warehouse.
Getting to the “right place” for our client was a bit of change management, education, goal clarification and alignment around the best technical approach. We won’t unpack all of that in this blog, but we will share some of that conversation around the technical approach and the 4 main dimensions that can drive the decision whether to pursue a Data Lake or Data Warehouse. Those dimensions are Data Structure, Data Storage, Access & Use, and Time to Market.
Data Structure
In our opinion, the structure of the data is the largest factor when evaluating whether to go with a Data Lake or Data Warehouse. In a data warehouse, the data is typically very structured and controlled. Getting to this structure usually involves normalization and transformation before the data is stored in a database. In contrast, a data lake can house data that is completely unstructured. The data lake can incorporate flat files and other forms of data NOT easily contained in a database. Data lakes can have structured data, but it is not required as it is for a data warehouse. The data structure has a direct impact on the amount of storage, how it can be accessed, how quickly it can be accessed and what transformations are required to use the data. We will now cover each one of these in detail.
Data Storage
As we mentioned above, the data stored within a data warehouse is contained in a database while a data lake can include spreadsheets and pdfs in a file store. If there is no need business need to normalize and transform the data, then a data lake with and a mix of database, file and blob storage could be enough for the business needs. However, if there is a preference to have the data normalized and transformed using a consistent model, then that data is most likely to be stored in a database that will benefit from using a data warehouse approach.
Access and Use
Thinking through how the use cases for the data can help define the right approach. In a data warehouse, users access the data through business intelligence (BI) tools such as PowerBi, Tableau or Spotfire. These BI tools provide the users with powerful capabilities to interrogate and report on the data. For the data lakes, similar tools may be utilized but if the data is unstructured, the users may have to manipulate the data prior to using it in analysis. If the users are comfortable with data transformation and using BI Tools then a data lake may offer more flexibility.
Time to Market
The lack of a requirement for structured data makes Data lakes a better choice, as the speed at which one can start capturing new data sets much faster when compared to the data warehouse. A data lake only needs the data files, so once the source is established the data will be in the lake. A data warehouse would require the source transformed and that transformation can take extra time and definition.
In Conclusion
Data warehouses and data lakes are not the same things despite the fact people commonly use the two interchangeably. When deciding what makes the most sense for your organization, we consider the differences that we discussed above as part of your evaluation. We also explore the concept of data stores in Data Stores: How to Enhance your Ecosystem - which is coming soon! We purposely delayed the introduction of this concept yet because this is one people are less familiar with and should be explored separately.
At Veritas Total Solutions, we help educate clients and design data architectures and infrastructures that best meet their needs. We offer a range of technology solutions across the business spectrum. If you are interested in learning more about our specific capabilities, contact us or subscribe to our blog to stay connected.