Hadoop’s Early Years
Ah, tis not what in youth we dreamed twould be!
– Growing Old, Matthew Arnold
Hadoop was conceived in 2004 by two Google engineers as the combination of a functional programming algorithm, Map-Reduce, and a file system that could be distributed across a cluster of servers.
By 2007 Hadoop was baptized into the Apache foundation and within a couple of years was powering some of the world’s largest websites. These pioneers in the big data market were information-centric web companies who had the necessary engineering talent to manage a cluster and, in most cases, derived utility from closing the loop with their production systems.
But is Hadoop really ready for the enterprise? Early adopters aside, can Hadoop grow up and amass the production level to truly make it the de facto standard?
Hadoop Maturity
Illustrations by Michael Capozzola
As the mainstream enterprises reach the end of their proof of concept trial, our due diligence has frequently uncovered a growing realization that Hadoop is still immature. The promise of linearly scaling costs is often marred by the reality of a technology still lacking the characteristics necessary for enterprise adoption. Commonly cited obstacles encountered as enterprise have built out their Hadoop clusters include:
- Inaccessible to analysts without programming ability – while Pig has simplified common tasks such as loading data and expressing transforms, it is still better suited to developers than market analysts. The employees who know the questions to ask do not have the tools to access the answers.
- Hadoop gave up ACID and lost his integrity – clusters have no record of who changed which record and when it was changed. Enterprises are now realizing that storage functionality they have always depended on (snapshots, mirroring) are lacking in HDFS.
- Incompatibility with existing tools – NFS is a file system long taken for granted.
- Enterprises thrive on structure – it’s a common myth that there is no need for structure to be imposed on data. Without structure, though, it is difficult to extract insight from data. Data without structure has limited value and applying the structure at query time requires a lot of Java code.
Hadoop in Real Time
Illustrations by Michael Capozzola
Furthermore, the batch-processed nature of Hadoop has always been discordant with the progression towards real-time communication. The new world of business intelligence demands real-time metrics on everything from inventory to consumer demand.
There is a host of new technologies that marry big data and real-time analysis. Twitter recently open-sourced Storm which is able to continuously query streams of data and updates clients in real-time. A Storm cluster is superficially similar to an Hadoop cluster but operates on incoming streams of data. Once enterprises have mined their historical data for insight, the next logical step will be to analyze the present.
It’s hard to see how Hadoop can simultaneously mature into a buttoned-down enterprise product and adapt to a real-time, agile world. But Hadoop’s lumbering nature could lead to his extinction if he doesn’t get into shape soon.