BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Can Hadoop Survive its Weird Beginning?

This article is more than 10 years old.

Image via CrunchBase

Hadoop has arrived as a force in the business computing landscape because it offers people the ability to store and analyze data at scale for an affordable price. But Hadoop is not like other efforts to commercialize open source in some important ways. In addition, it is arriving at a time in which some of the traditional advantages of open source-based business models have eroded because of cloud computing and other developments.

Most people think of Hadoop as pioneering technology that is being developed through an open source ecosystem. But if we take a closer look at Hadoop’s origins and the current structure of its community, it quickly becomes clear that Hadoop is not following the footsteps of other projects like Linux, Drupal, Alfresco, or JBoss. Hadoop started out differently and has evolved into a structure that is much different than any other project that is so prominent.

It is time to look at these differences and ask a couple of simple questions:

  • How is Hadoop different from other open source projects?
  • Are these differences and advantage or a threat to the success of Hadoop?

In addition, Hadoop is growing up and evolving in an era in which cloud computing has accelerated the time to market and provided a way to outsource operational expertise. These developments mean that typical ways of monetizing open source, such as offering management tools, are less compelling than they were for previous generations of software.

Here is my argument in a nutshell:

  • Hadoop is neither a community-based open source project like Linux or Drupal nor a commercial open core company like Alfresco, or JBoss. Instead it is a strange hybrid that has some significant disadvantages.
  • Hadoop is rising to prominence in an era of ubiquitous cloud computing and proven models for Software-as-a-Service, which dilute the power of some of the traditional models for commercializing open source.

In my view, these differences are important and represent a threat to Hadoop’s success. To put it simply: Can Hadoop survive its weird beginning?

A Simple Answer: Yes, Hadoop will Survive

The simplest answer to this question is, yes, of course Hadoop as a project will survive. It has too much to offer to fade away.

But I think this is a glib answer that ignores the power of the threats I detail below. To be more precise, I’m not arguing that the idea of having a scalable repository for arbitrary big data and a technique for distributing processing over it are under threat. No, these capabilities are core to the computing landscape already and will only become more adopted.

My question is simply this: Is Hadoop Facebook or is it Friendster? Will the ultimate realization of the powerful ideas it has introduced to the world be implemented in the form that Hadoop has created, or will a new way of making those ideas work emerge?

Hadoop is Breaking New Ground in Open Source Development

I hope many Hadoop fan boys and girls are shocked, shocked, that I would even ask such questions. After the shock has worn off, I hope they retreat from the Hadoop version of the Fox News mentality and start questioning their assumptions about the market. Maybe then they will consider and evaluate the threats I mention.

I don’t think Hadoop faces a clear path to victory unless the community overcomes three weaknesses that are part of the way the project has evolved.

The first weakness is that Hadoop is breaking new ground in the way its ecosystem is structured. Most open source projects that have had large success fall into one of the two following models:

  • The Community Model in which a genuine community signs on to a project, usually under the leadership of a BDFL (Benevolent Dictator for Life). In this model, which can be highly commercially oriented, a process is created to govern the project, usually heavily influenced by the BDFL. Good examples are Linux and Drupal among many.
  • The Open Core Model in which a company develops a product and makes all or part of it open source. The company may then charge for support, for versions that have additional features, and other extras. MySQL and Alfresco are examples of this. Most of the time, in these projects there really is no community outside of the founding company for the project. If additional development is done by a community it is to extend the central functionality or adapt it to new languages and such.

Three Headed Open Core

So the first weakness is that Hadoop doesn’t fall into either of these proven models and it is suffering for it. An apt way to describe Hadoop in its current state is as a Three Headed Open Core. You have Cloudera, Hortonworks, and MapR Technologies all working on making Hadoop better, but all working on their own strategy for commercializing Hadoop. All of the companies have professional services, education, and a distribution that can be purchased for a subscription. Cloudera is focused on creating a Hadoop management suite. MapR is more focused on fixing what it sees as fundamental weaknesses in core Hadoop capabilities with proprietary replacements and standards-based extensions. Hortonworks is the open source purist which wants to be the Red Hat of Hadoop.

There is a BDFL-type person in the Hadoop ecosytem, Doug Cutting, who works at Cloudera, but he doesn’t play the same role as Linus Torvalds does at Linux or Dries Buytaert plays at Drupal. He is respected, to be sure, but he is not guiding development priorities with a firm hand.

This three-headed model is producing all sorts of friction that mostly is discussed within the community but does occasionally burst into public. For example, the companies argue who has the most committers for the core project or who has contributed the most lines of code.

The problem is that the Hadoop community, even with the mature and excellent Apache process, is far from a smooth running process for governance, something which is hard to create in the best of conditions.

It is hard to measure how much this lack of focus is slowing Hadoop down, but the fact that MapR has introduced innovations to improve the core of Hadoop and Cloudera is working on extending Hadoop to better support real time analysis is a sign that the community is not moving forward lockstep.

Hadoop Needs To Evolve

The second weakness is that Hadoop needs to evolve to succeed. The good news is that despite the divergent approaches of the different open core heads, there is really one set of Hadoop APIs that all three heads agree on (for now) and that are the foundation of a thriving ecosystem of projects. This agreement provides some degree of portability across the three distributions. In this regard, Hadoop is model of unity compared to the variation found in the multitude of approaches to NoSQL databases.

But In my view, this common set of APIs needs to evolve for Hadoop to better serve its market. Of course, every open source project and every commercial product must evolve. But the need for Hadoop to evolve is perhaps more urgent because it was not created to meet the needs of the market it now finds itself in. Hadoop was created to help crunch massive web crawls into indexes to support searches.

Hadoop found a new purpose in the world of business as massive amounts of data from all sorts of sources started showing up. This data wasn’t all structured. It arrived in massive volumes. You couldn’t move it, so you needed to move the computing to the data, not the other way around.

Hadoop became a massive data lake that you could toss stuff into, knowing it would be around when you were ready. Hadoop became a massive data lake that was powered by inexpensive commodity hardware and could scale massively.

Unlike business intelligence systems using data warehouses, you did not have to know what how you were going to analyze the data when it arrive. You just stored it and figured that out later.

But now Hadoop is smack in the middle of the world of Business Intelligence. It must evolve from a batch infrastructure to better support application development. (Hadoop is not alone in making such a transition. Splunk has evolved from being a tool for IT operations to having many applications in business intelligence.)

Here are a few areas that Hadoop must evolve:

  • MapReduce for all its strengths is not something that is easy to use by the masses.
  • HDFS, the Hadoop File System, was not created to support wide use of the data inside it by existing techniques.
  • HDFS was written in Java as a matter of convenience. No file systems that have to deliver high performance are written in Java. Now that HDFS is being used in new ways, it must become faster.
  • Hadoop was not created to be highly available. It has many brittle areas, such as management of its name space, compared to the best enterprise software.
  • Hadoop requires you to run and manage a huge computing grid.
  • Hadoop does not support or integrate with the standards that several generations of business intelligence software
  • Hadoop does not provide data protection in the form of snapshots and mirrors required features for comparable systems.

The to do list for Hadoop is long and complicated. There are many projects that are addressing urgent needs. But Hadoop still is bound by the fact that its roots were not put down to serve its current market.

The urgent needs for evolution increases damaging effects of the three-headed open core model. If there were only one company in charge, I suspect that Hadoop would be fixing these problems faster. If the rumors about acquisitions become a reality, will the new owners of Cloudera, MapR, and Hortonworks get along better than the companies do now as independents or worse? Will a community like Linux develop or will there be massive forking? The largest community of business buyers will want to know. Please don’t argue that such uncertainty makes Hadoop more attractive.

Is the Hadoop Ecosystem Funded in the Wrong Way?

The third weakness is that the amount of funding being directed toward Hadoop development is too small and needs to be supplemented by companies who are benefitting from Hadoop.

It is easy to see how the Three Headed Open Core model is diluting investment in Hadoop. The three companies are competing with each other, sometimes independently developing the same capabilities (such as management tools) and so on.

But for an open source project to succeed, the success of the applications built on top of it or the use to which the project is put must provide funding for further development. In a community project, the companies who benefit provide direct resources. Linux long ago ceased to be a community on individuals and it rather a community of developers who are paid to work on it by companies like IBM, HP, Oracle, etc. Drupal has large community of independent individuals, but also has a core of people creating it who make their living from putting Drupal to work on one way or another.

Ideally the companies benefitting and making big bucks form using Hadoop should provide development resources. Where are those companies? I’m told that some major users of Hadoop don’t contribute their changes back to the community.

The funding for Hadoop is too separated from its use. RedHat could never have supported the development of Linux on its own. In addition, the successful applications have been slow to emerge and catch on. Mike Olson, CEO of Cloudera, has complained about this in public. This is a fact of life for many open source projects, but it may be a bigger problem for Hadoop, which is facing competition both from within the project and from new solutions outside the project.

Hadoop Has Lots of Competition

The weakness mentioned so far are all the more problematic because Hadoop has significant competition and also faces pressure from cloud offerings.

First of all, the big data landscape has lots of entrants, each with their own sweet spot. Consider this list:

  • Google Big Query
  • Splunk
  • MPP SQL such as EMC Greenplum, Teradata Aster, Sybase IQ
  • NoSQL databases like Couchbase, Basho Riak, and many others.
  • In-memory databases like SAP HANA, Scaleout, and Kognitio.

These are just a few highlights. The full list would include dozens of products including the offerings from IBM, Oracle, and SAP. Many of these technologies can work with Hadoop as well as compete with Hadoop.

Do these all do what Hadoop does exactly? No, each has a sweet spot. As data volumes grow and the data has more variety, Hadoop becomes more attractive. But these competitors will pick off lots of use cases and customers.

How the Cloud Hurts Management Capabilities

Second, imagine if Red Hat were starting today and instead of having to buy its distribution and management suite, you could buy access to cloud computing running Linux instead. Red Hat would have a smaller market.

The power of selling Hadoop management tools is reduced in the same way. Instead of having to worry about a management layer specifically for Hadoop, let Amazon do that and use Amazon Elastic MapReduce. The fact that the distributions can be offered through clouds, doesn’t make them less vulnerable to this sort of competition.

So Hadoop-based companies can expect to make less money on management tools than if they arrived before the cloud.

In addition, Hadoop is already being abstracted into the background. Companies like Alpine Data Labs and Teradata Aster, and open source projects like Cascalog are allowing queries and transformations to be expressed in higher level, simpler and more productive ways. It is not hard to imagine these and other approaches catching on and becoming preferable to raw MapReduce programming. In this case, Hadoop could become one of many engines for big data processing that takes place under the hood of these abstractions.

What’s at Stake?

Hadoop has made its point. The operating system needs a new capability, the ability to store huge amounts of data on commodity computers and then bring the compute to the data.

But now the question becomes different: Is the Hadoop Apache Project with its three heads the only way to create that capability?

Remember, the biggest market has yet to adopt Hadoop, the vast IT market. These buyers don’t care what’s fashionable in Silicon Valley. They want solutions to their business problems. They are willing to listen to any and all alternatives.

To make the case to these buyers, in my view, Hadoop has to overcome the weaknesses outlined, get out of its own way, and get better and more powerful much faster.

If Hadoop doesn’t do that, I believe it will move so slowly that it will become like Friendster, a great start but a weak finisher.

Follow Dan Woods on Twitter

Dan Woods is CTO and editor of CITO Research, a publication that seeks to advance the craft of technology leadership. For more stories like this one visit www.CITOResearch.com. Dan has performed research and writing projects for Splunk, SAP, and Oracle.