What is a data lake?
You’ve probably heard of data warehousing, but now there’s a newer phrase doing the rounds, and it’s one you’re likely to hear more in the future if you’re involved in big data: ‘Data Lakes’.
So what are they? Well, the best way to describe them is to compare them to data warehouses, because the difference is very much the same as between storing something in a warehouse and storing something in a lake.
In a warehouse, everything is archived and ordered in a defined way – the products are inside containers, the containers on shelves, the shelves are in rows, and so on. This is the way that data is stored in a traditional data warehouse.
In a data lake, everything is just poured in, in an unstructured way. A molecule of water in the lake is equal to any other molecule and can be moved to any part of the lake where it will feel equally at home.
This means that data in a lake has a great deal of agility – another word which is becoming more frequently used these days – in that it can be configured or reconfigured as necessary, depending on the job you want to do with it.
A data lake contains data in its rawest form – fresh from capture, and unadulterated by processing or analysis.
It uses what is known as object-based storage, because each individual piece of data is treated as an object, made up of the information itself packaged together with its associated metadata, and a unique identifier.
No piece of information is “higher-level” than any other, because it is not a hierarchically archived system, like a warehouse – it is basically a big free-for-all, as water molecules exist in a lake.
The term is thought to have first been used by Pentaho CTO James Dixon in 2011, who didn’t invent the concept but gave a name to the type of innovative data architecture solutions being put to use by companies such as Google and Facebook.
It didn’t take long for the name to make it into marketing material. Pivotal refer to their product as a “business data lake” and Hortonworks include it in the name of their service, Hortonworks Datalakes.
It is a practice which is expected to become more popular in the future, as more organizations become aware of the increased agility afforded by storing data in data lakes rather than strict hierarchical databases.
For example, the way that data is stored in a database (its “schema”) is often defined in the early days of the design of a data strategy. The needs and priorities of the organization may well change as time goes on.
One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it.
Another advantage is that the data is available to anyone in the organization, and can be analyzed and interrogated via different tools and interfaces as appropriate for each job.
It also means that all of an organization’s data is kept in one place – rather than having separate data stores for individual departments or applications, as is often the case.
This brings its own advantages and disadvantages – on the one hand, it makes auditing and compliancy simpler, with only one store to manage. On the other, there are obvious security implications if you’re keeping “all your eggs in one basket”.
Data lakes are usually built within the Hadoop framework, as the datasets they are comprised of are “big” and need the volume of storage offered by distributed systems.
A lot of it is theoretical at the moment because there are very few organizations which are ready to make the move to keeping all of their data in a lake. Many are bogged down in a “data swamp” – hard-to-navigate mishmashes of land and water where their data has been stored in various, uncoordinated ways over the years.
And it has its critics of course – some say that the name itself is a problem (and I am inclined to agree) as it implies a lack of architectural awareness, when a more careful consideration of data architecture is what’s really needed when designing new solutions.
But for better or worse, it is a term that you will probably be hearing more of in the near future if you’re involved in big data and business intelligence.
Are you ready to dive head first into the data lake or do you prefer to keep your data high and dry?