Couchdoop: Couchbase Meets Apache Hadoop

Sneak Peak:

Couchdoop is a Couchbase connector for Apache Hadoop, developed by Avira on CDH, that allows for easy, parallel data transfer between Couchbase and Hadoop storage engines. It includes a command-line tool, for simple tasks and prototyping, as well as a MapReduce library, for those who want to use Couchdoop directly in MapReduce jobs. Couchdoop works natively with CDH 5.x.
Couchdoop can help you:

  • Import documents from Couchbase to Hadoop storage (HDFS or Apache HBase)
  • Export documents from Hadoop storage to Couchbase
  • Batch-update existing Couchbase documents
  • Query Couchbase views to import only specific documents (daily imports for example)
  • Easily control performance by adjusting the degree of parallelism via MapReduce

In the remainder of this post, you’ll learn the main features of Couchdoop and explore a demo application .

Why Couchdoop?

In many Big Data applications, data is transferred from an “operational” tier containing a key-value store to an “analytical” tier containing Hadoop via Apache Flume or a queuing service such as Apache Kafka or Rabbit MQ. However, this approach is not always possible or efficient, such as when the events themselves are highly related (like a shopping session with several clicks and views) and could be conveniently grouped before being pushed to Hadoop. In those cases where Couchbase serves as the operational tier, Couchdoop’s import feature comes in handy. Conversely, you can use Couchdoop’s export feature to move data computed with Hadoop into Couchbase for use in real-time applications.

The data collected by the operational tier can be imported in the analytical tier where traditionally it will be stored in HDFS. By using the tools provided by CDH, the data could be processed and enhanced for various use cases. One use case is ad hoc querying, which allows business people to query the data in real time using  Impala. Another use case is improving user experience by using machine-learning algorithms to adapt the application to users’ needs. For this use case, both MapReduce and Apache Spark, which are included in CDH, can be used. (Spark comes with its own machine-learning library, MLlib.) Apache Mahout offers time-proved algorithms written in MapReduce as well as newer and faster implementations written in Spark. The outcome of the machine-learning algorithms can be exported to the operational tier using Couchdoop”.

Read the whole article here.

Avira, a company with over 100 million customers and more than 500 employees, is a worldwide leading supplier of self-developed security solutions for professional and private use. With more than 25 years of experience, the company is a pioneer in its field.