6 Best Practices for Data Lakes

June 1, 2019

As the volume and variety of data continue to increase, organizations are realizing the need for a more efficient way to manage enterprise data. An enterprise data warehouse (EDW) is designed to collect data from enterprise applications – data that follows an enterprise data model. On the other hand, a data lake collects and stores information in its native format without the need to process it. Retaining the data in its native format makes it faster and easier to analyze the data.

Some organizations store only a few types of data, so traditional data warehouses serve the purpose. Other organizations collect data from multiple sources making it necessary to store their data in a data lake. There are many benefits to deploying data lakes as they typically have very few regulatory functions, meaning that all types of structured and unstructured data can be stored. However, successful data lakes require data and analytics leaders to follow certain best practices:

1. Identify the Business Problem

Many organizations fail to implement a data lake as they haven’t identified a clear business case for it. Organizations that begin by identifying a business problem for their data, and stay focused on finding a solution, are more likely to be effective. One of the common drivers for a data lake is that existing systems can no longer keep up. As the volume and variety of data continue to grow, businesses find it challenging to store, manage, and process the data at the speed required for timely action. Therefore, it is important to understand the value of solving the business problem and quantifying the potential benefits of deploying a data lake.

2. Find the Right Resources

Data lake implementations are a company-wide, strategic priority involving individuals from different departments, not just the IT team. It is important to have the right people and team in place for the implementation. Companies often don’t fully understand how many resources it takes to build a data lake, because it’s all new to them. Don’t fall into the trap of building everything yourself. Assess whether your team has the experience and if it doesn’t, there are other options – training your existing team, hiring the people you need, or outsourcing the implementation to experts.

3. Upgrade Your Data Architecture

Data lakes seldom exist in siloes. They are often integral parts of a larger data architecture or multiplatform ecosystem. A data lake can also extend traditional applications – including ERP, financials, content management, and more. A data lake can be a modernization strategy that extends the life and functionality of an existing application or data environment.

4. Cleanse Your Data

Data cleansing improves the data quality and utility by identifying and correcting errors before the data is moved to the data lake. Manual data cleansing may not be possible, depending on the amount of data and number of data sources but there are data cleansing tools that can automate a part of the process. Both manual and automatic data cleansing involve a series of steps – correcting any mismatches, recreating missing data wherever possible, reordering rows and columns, ensuring that data is in the same format as the destination, and deleting duplicate records.

5. Secure Your Data Lake

You need to ensure that your data is actively and securely managed. Some companies face security issues because they misconfigure permissions and make their data easily accessible. There are many tools and technologies that can help with security and governance. You must also evaluate your vendors’ security measures like access permissions and encryption. At a minimum, you need the following capabilities to ensure that your data lake is secure:

  • User authentication
  • User authorization
  • Data in motion encryption
  • Data at rest encryption

6. Enable Self-Service Data Access

Data quality is critical but delivering that information in a timely manner to analysts, data scientists, and executives is equally important. Nowadays, both business and technical users expect self-service access to data with the ability for data exploration, preparation, analytics, and visualization. Your data lake should be able to deliver information ad hoc, and on-the-go to all business functions – finance, marketing, sales, product development, or customer service teams.


Applications of Big Data

Subscribe to our blog