A central tenant of Big Data, the Data Lake Concept has been gaining popularity recently, as many vendors have incorporated Data Lakes to their existing enterprise analytics software or have created whole new platforms around the concept. Should the increased data requirements of the Smart City take advantage of the concept?
In the early 2000s web giants Yahoo and Google took on academic research and some large investments in engineers to develop a platform which would allow them to store and actively use the massive amounts of data generated by their web scale services. In effect they were creating what is today referred to as the Data Lake and principally, the main software element that empowers the Data Lake – Hadoop.
The problem Hadoop was perfected to solve was the ability to store massive amounts of data and be able to process the data, with cost effective and scalable hardware that a business could afford to deploy. While this problem emerged from global web services, it is in effect also solving the issue businesses and governments have. This is namely the ability to retrieve large amounts of data and analyse across it in near real time, where previously this was an expensive and slow to develop process. In effect, Hadoop made it possible for archives to be live and accessible and not, well … archived.
While there are other approaches to storing 'live archives' and carrying out analytics on them, Hadoop, HDFS and the Data Lake have emerged as the leading industry standard. So much so that every database and analytics vendor has some direct implementation of Hadoop or is inspired by it.
Enter the Data lake concept. Hadoop enables a central tenant of data analytics to be overturned, namely that data should reside in one place, analytics software in another and that analysis needs to be performed by retrieving the required data. Hadoop reverses the process by dividing the analysis into packages that are distributed to the data stores, which themselves are distributed across as many machines as are needed, even thousands. This means cheap data stores can keep data 'live' for analysis purposes. It also means that data from business systems can be copied economically into the analytics system, eliminating the complexity of selecting and archiving data.
Another revolutionary part of the Data Lake concept is that both structured and unstructured data types can be ingested into the data store as they are produced by people or source systems. This means that little time is wasted in transformation of the data, reducing costs and performance requirements. It also means that structured data (relational databases for example) and unstructured or semi-structured data, like posts to social media or archived documents, can be stored side by side and cross analysed.
Essentially then the Data Lake is a raw collection of all of an organisation’s data, on which operations like searches or analysis can be carried out in massively parallel processes. Putting all the data together, instead of in silos, allows for new analysis not previously possible. For example this approach of massive amounts of available data gives enough depth in patterns of data that Predictive Analytics becomes a tools organisations can use. In the City context, this is a very powerful infrastructure planning and event management tool.
The principle benefits of a Data Lake are:
The Smart City is a City in which ICT is used as an enabler to improve the lives of its residents. All aspects of City life are effected, from housing to transport to energy, education, security, the economy and beyond. Cities are, therefore, now heading towards a similar challenge that the large web services have been dealing with for a decade, namely analysis across massive amounts of data, as service delivery to citizens becomes more dependent on data. Not only have traditional services been digitized but new technology is adding to City data growth.
Among traditional services, many cities such as Dubai has digitisation policies in place and digital eGovernment channels such that citizens, residents and visitors can interact with their services. New technologies in sensing and infrastructure management also bring with them a flood of data. While these data streams were dealt with in silos in the past, cities are finding value in joining data streams to produce new services or information that enables them to enhance the City experience. Examples of this include combining municipal infrastructure data about roads with that from emergency services to detect, or indeed predict, where accident black spots are more effectively. When adding to this utility data and social media analysis, a City can view its main flows of people and plan resources and infrastructure in accordance.
Another factor bringing City Governments into the data arena is the growth of Open Government and Open Data policy. This has been driven in part by city residents’ demands for clarity on the planning and operation of their City and partly by the recognition of City planners that involving a wider community in decision making and execution can improve the delivery of services. As an example, when Transport for London released live information about public transport movements to anyone wishing to develop applications against it, thousands of developers created hundreds of applications. Now these are helping millions of people get around London.
All of these factors drive data growth in an exponential manner. Hence the ease of integration, scalability and cost effectiveness factors of the Data Lake have made it a popular choice among enterprises and governments already.
The challenge is, therefore, the aggregation of this data in an efficient and cost effective manner.
Just like Global internet services or large corporations, Cities have widespread data sources, in differing formats. They also have limited budgets but a large pipeline of demands for information and services.
Not unreasonably (as the Data Lake concept came from there) some considerable thought has gone into how the Data Lake can help Government in the USA:
· “A Data Lake gives government agencies
nearly 70% of all (US Government) agency leaders believe that in the next five years, big data will play a critical role in fulfilling mission objectives. Yet one third of that data is unstructured (Data Lake is a central concept of Big Data)
often, these data sources are scattered and stovepiped. If government agencies could more effectively relate and analyze these petabytes of data, society would reap the benefits.
IDC believes that Enterprise Data Lakes will become a core part of enterprise storage infrastructure various organizational units, will no doubt be compelled to establish enterprisewide data lakes upon which various workloads can concurrently operate.
Every major internet player is today using Hadoop to handle massive amounts of structured and unstructured data, such as are seen in the Smart City context. These include Facebook, Google, Twitter, LinkedIn and more. Most are stating that they are aggregating Petabytes of data from disparate sources and processing it daily for a myriad of tasks.
The US government and several financial houses are getting tremendous value out of it today
Some Criticisms have been raised by observers, as an example, in July of 2014, Gartner, in
"The Data Lake Fallacy: All Water and Little Substance." warns enterprises to "Beware of the Data Lake Fallacy".
Looking into the details, the concerns raised by Gartner centre mainly around:
These concerns have been addressed in the many government implementations, as only government data scientists will directly interact with the lake, Gartner's main concern does not apply. In fact Garner specifically points to data scientists as benefiting from the Data Lake concept.
A further potential challenge that Gartner tackles is performance. "Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure." This statement could be challenged on its generality and its lack of regard for certain modern tools, such as in-memory computing. However, even taking it at face value, the "performance" being ignored here is the end to end time to user of data. The time it takes to get data to where it is needed, day in day out, with new data streams coming online regularly. For this aspect, the architecture of the Data lake, where data is taken in, whatever the format, can enhance speed of the overall ingestion and delivery process in many cases.
The aim of the Data Lake is not to replace operational systems, such as CRM or ERP, the statement would indeed be correct if a lake attempted to emulate functionality of these. However the concept of Data Lake is rather to sacrifice certain efficiency in some areas, such as single operational transactions, in order to gain sizeable increases in performance in others, namely data orchestration and delivery of actionable information from across varied data sets.
The raison d'etre of Data Lake is that previous operational and business intelligence systems could not achieve this goal in a timely or resource and cost efficient manner. Otherwise it would not be even imagined.
Lastly, even comparing the Data Lake to traditional data warehouse technologies may not be accurate. The Data Lake's principle benefit is the aggregation of data. Some of the traditional tasks will then be performed on the amalgamated data, or the data can be used in its raw format, especially when it comes from business intelligence systems. Therefore the Data Lake cuts across operational, archive and business intelligence systems, adding value to existing ones and not necessarily always replacing them.
Gartner goes on to applaud the Data Lake concept in several ways:
Understanding the costs in advance is relatively simple. Hadoop is free software and it runs on commodity hardware, which includes cloud providers. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost. And because Hadoop uses commodity hardware with this linear scaling, every month, costs decrease or provide more capacity for the same price point. The hardware can be sourced from any current or new supplier.
Of course free software does not mean no cost. There are operational costs incurred in managing the Hadoop system across many servers, however, as the software is common across all the machines, and requires little per-machine tuning, the operational cost scales sub-linearly. Further, one of the major costs incurred is in the skilled people required to set up, maintain and get value from data in Hadoop. While these skill sets are becoming more common over time, they can be expensive and rare.
While Hadoop is at the heart of the Data Lake, there are also structured storage components, such as traditional business intelligence platforms, which can support analytics requirements. These have significant cost structures, however can be leveraged from existing investments.
Hadoop systems, including hardware and software, cost about $1,000 a terabyte11 this represents a saving of 5X or 10X on traditional data systems.
Technical staff are required with specialist skills, however as most major vendors now support Hadoop, these skill sets are becoming more common.
Gaining the cost saving promised by Hadoop does require larger numbers of servers than traditional database set ups, with inherent increases in maintenance and data center costs, however the gains in overall efficiency, scalability and new insights balance these overheads. Further, these can be mitigated by using Hadoop in a Cloud / Software as a Service mode. By doing so, the Hadoop server requirements are abstracted to the Cloud provider. If this is not possible for legal reasons (some countries require data remain within their borders) then a private or hybrid Cloud can still be leveraged to the same positive effect.
The Data Lake is a central concept of Big Data and its principal industry standard implementation is Hadoop
Overall, the Data Lake, allied with Cloud, is a significant improvement over the traditional DWH model for many use cases and allows for significantly less operational and management resourcing, while powering more accurate analytics and completely new insights. It also enables applications that were not previously practical.
Andrew Rippon, Senior Consultant
· IDC- Ashish Nadkarni, Laura DuBois “Enterprise Data Lake Platforms: Deep Storage for Big Data and Analytics” 2014
· NXN analysis
· Atos – "Data Analytics as a Service: unleashing the power of Cloud and Big Data"
· Oracle Corp.