While we have talked a lot about the rising need for data lakes, it’s probably as important to talk about how easily they can go wrong in the absence of a good data lake strategy. While most businesses expect phenomenal insights, not enough attention is paid to actually setting it up in the right manner. And that is where it can all start to unravel.
It's not uncommon to see scenarios where businesses have invested a lot of time, money and resources into building a data lake but it’s actually not being used. It can be that people are slow to adopt it or it could be that faulty implementation actually made the data lake useless.
So here, we take a brief look at six common data lake strategy pitfalls, and how to avoid them.
Challenges involved in Loading Data
There are two challenges involved when loading data into a data lake:
Managing big data file systems requires loading an entire file at a time. While this is no big deal for small file types, doing the same for large tables and files becomes cumbersome. Hence to minimize the time for large data sets, you can try loading the entire data set once, followed by loading only the incremental changes. So you can simply identify the source data rows that have changed, and then merge those changes with the existing tables in the data lake.
Data lake consumes too much capacity to load data from the same data source into different parts of the data lake. As a result, the data lake gets a bad reputation for interrupting operational databases that are used to run the business. To ensure this doesn’t happen, strong governance processes are required.
Lack of Pre-planning
Data lakes can store an unfathomable amount of data, but not planning the value of data before dumping it is one major reason for their failure. While the point of a data lake is to have all of your company’s data in it, it is still important that you build data lakes in accordance with your specific needs. Balancing the kind of data you need with the amount of data you dump into the data lake ensures the challenges of the data lake implementation is minimized.
When you store data into a data lake, you also need to make sure it is easy for analysts to find it. Merely storing all the data at once, without cataloguing is a big mistake for a few key reasons:
- Can lead to accidental loading of the same data source more than once, eating into storage
- Uncatalogued data cannot be used to it’s full potential. Without tags around lineage, source, and other markers that classify the data, stakeholders as well as the models they build won’t be able to find and work with the right data.
Ensuring metadata storage is key to a data lake that’s actually useful. There are several technologies available to set up your data cataloging process. You can also automate it within your data lake architecture with solutions like AWS Glue.
Duplication of Data
When Hadoop distributions or clusters pop up all over the enterprise, there is a good chance you’re storing loads of duplicated data. As a result, data silos are created which inhibits big data analytics because employees can’t perform comprehensive analyses using all of the data.
All of this essentially re-creates the data proliferation problem data lakes were created to solve in the first place.
On of the most common mistakes organizations make is building their data lakes with inelastic architecture. Several of them start out with one server at a time, slowly and organically growing their big data environment, and adding high performance servers to keep up with the business demands. While this decision is taken because data storage can be costly, it eventually proves to be a mistake in the long run when the growth of data storage outpaces the growth of computing needs and maintaining such a large, physical environment becomes cumbersome and problematic.
Not the Right Governance Process
Not using the right governance process can be another obstacle to your data lake implementation.
- Too much governance imposes so many restrictions on who can view, access, and work on the data that no one ends up being able to access the lake, rendering the data useless
- Not enough governance means that organizations lack proper data stewards, tools, and policies to manage access to the data. Unorganized and mismanaged data lakes can lead to an accumulation of low quality data, which is polluted or tampered with. Eventually the business stops trusting this data, rendering the entire data lake useless
Implementing good governance process and documenting your data lineage thoroughly can help illuminate the actions people took to ingest and transform data as it enters and moves through your data lake.
While this is by no means an exhaustive list, these are some of the most seen mistakes that businesses make. Plugging these holes in your data lake strategy sets you up for better returns from your initiative right out the gate. It also ensures that your data lake does not become a data swamp where information and insights disappear without a trace.
Working on a data lake strategy for your enterprise? Or building the right data lake architecture to leverage and monetize your data?
Tell us a bit about your project and our experts will be in touch to explore how Srijan can help.