Dec 7, 2012 6:30 AM

Your Own Private Google: The Quest for an Open Source Search Engine

Google has created many custom software platforms that take advantage of its massive server farms, and it made a habit of publishing academic papers that detail these innovations. That has led to a proliferation of open source clones that operate in much the same way. These include file systems for storing data, and processing platforms for crunching all that data. But what about the most famous Google innovation, the one that has used its sweeping server farms to the greatest effect? What about Google search?

Image may contain Book Text and Magnifying

Google built its web empire on commodity hardware. Rather than invest in powerful, expensive supercomputers, it harnessed the collective power of tens of thousands of ordinary servers, cutting costs but also finding a better way to deal with hardware failure. This was at the heart of the company's genius, and now, so many others are following suit.

Google created many custom software platforms that take advantage of its massive server farms, and it made a habit of publishing academic papers that detail these innovations. That has led to a proliferation of open source clones that operate in much the same way. These include file systems for storing data, and processing platforms for crunching all that data.

>What about the most famous Google innovation, the one that has used its sweeping server farms to the greatest effect? What about Google search?

But what about the most famous Google innovation, the one that has used its sweeping server farms to the greatest effect? What about Google search?

The company guards its search platform like the crown jewels. It's not about to release a paper describing how it all works, so producing an open source clone is more difficult. But there are options, and the push toward open source versions of the Google search engine has gathered some steam in recent months, with the arrival of a new company called ElasticSearch.

These projects aren't trying to compete with Google's public search engine -- the one you use every day. They're trying to compete with Google's search appliance and other products that help enterprises -- i.e., big businesses -- find stuff inside their own private networks.

Ayan Barua -- co-founder of Credii, a startup that helps businesses select software and services -- says: "Open source has hardly made a splash in the enterprise." But he does believes that ElasticSearch and a similar outfit, LucidWorks, are gaining mindshare among developers.

In 1999, a man named Doug Cutting started work on a project called Lucene, which was meant to provide an open source alternative to the Google and Yahoo search engines. Eventually, Cutting shifted his focus to Hadoop -- a clone of Google's MapReduce number-crunching platform that was originally designed to underpin Lucene -- but Lucene lives on.

"It's probably the most advanced library out there today -- open source or not," says Shay Banon, the founder of ElasticSearch, which oversees an open source search engine based on Lucene.

Lucene, you see, is a software library. It provides the basic building blocks for a search engine. You still need to build the search engine itself, and over the years, the library has given rise to two major projects that seek to provide a search engine capable of scaling out across hundreds or thousands of servers. First there was Solr. And now, there's ElasticSearch.

>'It's probably the most advanced library out there today -- open source or not.'

Shay Banon

Solr was created in 2004 by a CNET developer named Yonik Seeley. The online publisher had been using a custom search service from AltaVista to power its site search features, but the service was being discontinued after AltaVista was acquired by a competitor. Seeley says the company solicited bids for another commercial solution, but because CNET is spread out over so many different properties, the bids all seemed too high.

So, the IT team decided to develop their own search solution based on the open source database MySQL. But they also wanted to have a "Plan B" based on Lucene, in case they ever needed a more sophisticated search solution. Seeley was hired to work on the search team, and he built this "Plan B."

Originally called SOLAR -- for Search On Lucene And Resin -- Solr was eventually adopted throughout CNET, according to Seeley. It was open sourced in 2006. In 2007, Seeley left CNET to co-found LucidWorks (originally called Lucid Imagination), a company that sells tools based on Solr.

ElasticSearch arrived in 2010, but it wasn't until this year that its creator, Banon, founded a company that seeks to commercialize the code. ElasticSearch raised its first round of funding, $10 million from Benchmark Capital, just last month.

The difference is that ElasticSearch was specifically designed to scale across hundreds of thousands of servers -- Google-style. Banon built his first open source search server, Compass, using Lucene in 2004, but ElasticSearch is a different animal. He says he originally thought about scaling Solr to many servers in this way, but he soon decided that it would be better to start over and build another search server from scratch -- while still using Lucene as a starting point.

Solar Goes Big

But while Banon was building ElasticSearch and starting his own company, the team behind Solr was doing what he thought was impossible: They were scaling Solr to thousands of machines. It was over two years in the making, but Solr 4.0, released last October, finally makes this feasible. The team says that in building the platform, it leaned on white papers published by Google and Amazon, and it uses Zookeeper, a tool for managing server clusters that dovetails with Hadoop.

Seeley says one of the biggest challenges was testing the thing. To do so, the team borrowed a concept from Netflix: the "chaos monkey." The idea is to shut down servers that are part of the cluster at random to see how the system handles failures. Although they didn't actually use Netflix's own open source version, Seeley says they took the concept to heart.

Although Solr and ElasticSearch are coming closer together in terms of what they do, Banon doesn't see the two companies as competitors. While LucidWorks is taking on proprietary enterprise search vendors like the struggling HP acquisition Autonomy, Banon says that ElasticSearch isn't interested in that market.

>'We're not doing enterprise search in the traditional sense. We're not going to index SharePoint documents.'

Shay Banon

"We're not doing enterprise search in the traditional sense," he says. "We're not going to index SharePoint documents." Instead, he says, it's a way for developers and IT teams comb through server logs and archives of more complex data.

LucidWorks CEO Paul Doscher doesn't see the distinction. He does see the companies as competitors, saying that many companies are evaluating -- or have evaluated -- both.

Ayan Barua says that while LucidWorks is the incumbent with a solid reputation, ElasticSearch is coming on fast. And now that it has a commercial support structure, it will probably make gains more quickly. It's also more proven in environments spread across many servers.

The rub is that the open source community around ElasticSearch is pretty small. Banon is still by far the biggest contributor to the project. LucidWorks engineers are still the biggest contributors to Solr, but there's a real community of developers contributing from outside the company. But Banon says ElasticSearch is catching up. The company has hired some of the most active developers in the larger ElasticSearch ecosystem, and he says anyone is welcome to submit improvements to the ElasticSearch code base.

Regardless of concerns about code contributions, Barua says, a number of companies have switched from Solr to ElasticSearch, including Foursquare.

Yes, both of these open source challengers are years behind Google's public search engine. But the point is that they're open source. Google is not.