
Adata lake is a system orrepository of data stored in its natural/raw format,[1] usually objectblobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc.,[2] and transformed data used for tasks such asreporting,visualization,advanced analytics, andmachine learning. A data lake can includestructured data fromrelational databases (rows and columns),semi-structured data (CSV, logs,XML,JSON),unstructured data (emails, documents,PDFs), andbinary data (images,audio, video).[3] A data lake can be establishedon premises (within an organization's data centers) orin the cloud (usingcloud services).
James Dixon, then chief technology officer atPentaho, coined the term by 2011[4] to contrast it withdata mart, which is a smaller repository of interesting attributes derived from raw data.[5] In promoting data lakes, he argued that data marts have several inherent problems, such asinformation siloing.PricewaterhouseCoopers (PwC) said that data lakes could "put an end to data silos".[6] In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single,Hadoop-based repository."
Many companies usecloud storage services such asGoogle Cloud Storage andAmazon S3 or a distributed file system such asApache Hadoop distributed file system (HDFS).[7] There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake atCardiff University is a new type of data lake which aims at managingbig data of individual users by providing a single point of collecting, organizing, and sharing personal data.[8]
Early data lakes, such as Hadoop 1.0, had limited capabilities because it only supported batch-oriented processing (Map Reduce). Interacting with it required expertise in Java, map reduce and higher-level tools likeApache Pig,Apache Spark andApache Hive (which were also originally batch-oriented).
Poorly managed data lakes have been facetiously called data swamps.[9]
In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to managebig data".[10]PwC was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO ofCambridge Semantics:
We see customers creating big data graveyards, dumping everything intoHadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.[6]
They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data andmetadata are important to the organization.
Another criticism is that the termdata lake is used with many different meanings.[11] It may be used to refer to, for example: any tools or data management practices that are notdata warehouses; a particular technology for implementation; a raw data reservoir; a hub forETL offload; or a central hub for self-service analytics.
While critiques of data lakes are warranted, in many cases they apply to other data projects as well.[12] For example, the definition ofdata warehouse is also changeable, and not all data warehouse efforts have been successful. In response to various critiques, McKinsey noted[13] that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome.
Data lakehouses are a hybrid approach that can ingest a variety of raw data formats like a data lake, while also providingACID transactions and enforced data quality like adata warehouse.[14]
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.[permanent dead link]