Preparing For A Data Lake Implementation

By Kimi Mahajan Aug 29, 2019

Data remains a giant value generator and reinforces your enterprise’s ability to stay ahead of the competition.

However, managing, securing and storing data for its continued relevance and using that voluminous information to your advantage is difficult at times, and requires a streamlined process flowchart.

So, how do you make data more useful to you and benefit from its infinite possibilities? What are the cutting-edge tools you need to keep your enterprise future-ready?

We have already discussed the basics of Data Lake and  the expected stages of data lake implementation. Let’s dig deeper as to when and why to implement data lakes and how to strategize the implementation process.

When Should You Opt for a Data Lake

Here are a few scenarios you could be looking at, when it comes to enterprise data:

  • You’re working with a growing amount of unstructured data
  • You want to leverage big data across your offerings
  • Your organization needs a unified view of information
  • You need to be able to perform real-time analysis on data
  • Your organization is moving towards a culture of democratized data access
  • You need access to data, analytics and applications
  • Your organization can benefit from elasticity of scale

If one or more of these look familiar, then it’s time to formulate a phased transformational process.

Traditionally, an Enterprise Data Warehouse (EDW) has served as the foundation for data discovery and functioned well in defining the data according to its quality. However, EDWs are restricted in scope and ability, and are unable to handle data complexities.

So a data lake is required, to expand the possibilities of what you can do with your data. You can take a look at the whole data lake vs. data warehouse discussion, and see how they are actually complimentary.

That said, you can take a call whether now is the right time to start with a data lake or can you invest in that a few months/years down the line. And that depends mostly on your current business goals and challenges, and the kind of data that’s currently most valuable to you.

Here’s a list of pointers to consider before preparing to implement data lake architecture:

Type of Data

Data lakes are best used to store constantly generated data, which often accumulates quickly.

Usually streaming data has a common workload of tens of billions of records totalling to hundreds of terabytes. If you’re handling such huge amount of data, then you should definitely consider a data lake since the costs of structuring and storing it in a relational database will be too high.

Choosing to stay with data warehouse could be a better choice if you’re mostly working with traditional, tabular information, e.g., data generated by financial, CRM or HR systems.

Understanding the Intent

One of the great things about data lakes is the flexibility with which data is ingested and eventually be used, with a sole principle to ‘store now, analyze later’.

A data lake could be a good fit for a project where higher level of flexibility is required.

Complexity of Data Acquisition Process

The process of adding newly acquired data to your warehouse can often be a resource-intensive process. And the process can even get more complex when it comes to unstructured or semi-structured sources, with a serious ETL overhead in order to ingest the data into a format that your data warehouse can work with.

If this complex process is making you consider giving up on some sources altogether, it’s time to consider a data lake – which will allow you to store all the data with minimal overhead, and then extract and transform the data when you want to actually do something with it.

Existing Tools and Skills

A data lake would typically require big data engineers, which are difficult to find. In case of lack of such skills, consider sticking to your data warehouse until the prerequisite engineering talent is hired to manage your data lake.

Data Management and Governance

Both data lakes and data warehouses pose challenges when it comes to governance. Data warehouses pose the challenge of constantly maintaining and managing all the data, whereas data lakes are often quite difficult to effectively govern. Whichever approach you choose, make sure you have a good way to address these challenges as per your project.

The above points will help you decide to opt for data lake or not.

Once you decide to stay with data lake, blindly plunging into its implementation won't necessarily benefit your organization. The big picture of what you want to achieve with your data, and a strategy for a cohesive data infrastructure is crucial.

Strategy for Implementing Data Lake

A haphazard approach may lead to several challenges hampering the use of a data lake to support big data analytics applications.

In the absence of an overarching strategy, a lot of data handling best practices can get overlooked, causing challenges and bottlenecks further down the line. For example, not documenting the relevance of data objects stored in a data lake might make it difficult for data scientists to find relevant data and track who accesses what data sets and determine what level of access privileges are needed on them.

So, here are seven steps to avoid such concerns for implementing data lakes.

  1. Create a taxonomy of data classifications
    Classification of data objects plays an important role in how they’re organized. Identify the key dimensions of the data such as data type, content, usage scenarios, groups of possible users and data sensitivity as part of your classifications.
  2. Design a proper data architecture
    Apply the defined classification taxonomy to direct how the data is organized. Include file hierarchy structures for data storage, file and folder naming conventions, access methods and controls for different data sets. 
  3. Employ data profiling tools
    The segregation of data going into a data lake can be easily done by analyzing its content. Data profiling tools can help by gathering information about what's in data objects, thereby providing insight for classifying them. They can also help in identifying data quality issues to ensure analysts are working with accurate information.
  4. Standardize the data access process
    Use of diverse data access methods to obtain different data sets often pose difficulties. Standardizing the procedure with the help of a common and straightforward API can simplify data access and ultimately allow more users to take advantage of the data.
  5. Develop a searchable data catalog
    Prospective users might not be aware of what's in a data lake and where different data sets are located. A collaborative data catalog allows the users to know the details about each data asset and provides a forum for groups of users to share experiences, issues and advice on working with the data.
  6. Implement sufficient data protections
    Aside from the conventional aspects of IT security, utilize other methods to prevent the exposure of sensitive information contained in a data lake. This includes mechanisms like data encryption and data masking, along with automated monitoring to generate alerts about unauthorized data access or transfers.
  7. Raise data awareness internally
    Ensure the users of your data lake are aware of the need to actively manage and govern the data assets it contains with appropriate training. Knowledge of using the data catalog to find available data sets, and configuring analytics to access the data they need will help press upon them the importance of proper data usage.

Organizations are increasingly attempting to innovate processes, driving heightened service excellence and delivery quality. Interested in knowing how data lakes represent a smarter opportunity for effective data management and usage for your organization?

Contact us and let our experts do the talking.

Let’s start our conversation

  • Business Inquiry
  • Career
  • Others

Business Inquiry