Recovering Couchbase data directly from vBuckets

11 years ago

Unfortunately, the restore tool from Couchbase failed, the automatic data recovery failed, but at least our backups worked (of course!). While waiting for the backups to be restored, we started thinking what we could have done if the backups had also failed. So we came up with the idea of restoring data directly from vBuckets.

What are vBuckets anyways? Every key from the database is attributed to a vBucket, and multiple vBuckets reside on each server. In theory, a vBucket is defined as the “owner” of a subset of the key space of a Couchbase Server cluster (for more detailed information, check out this paper). Basically, vBuckets are a bunch of files that contain your data. These files reside in a directory with the bucket’s name inside the document path. If I do a list of files on one of our buckets, I get a bunch of files named “%d.couch.%d” (i.e. a number followed by a dot, the word ‘couch’ and another number). These ‘magic’ files contain all your data.

Dump the data

Unfortunately, the data is stored in a binary format inside the vBuckets. Fortunately, Couchbase provides a small utility program called couch_dbdump (if you’ve installed the Debian package, its full path is /opt/couchbase/bin/couch_dbdump). Let’s look over the output (in this example, we’ve changed the actual Json document):

  $ /opt/couchbase/bin/couch_dbdump 986.couch.11
  Doc seq: 1873
       id: offers::3bb6d87e0acfbd1431e33955a8c068d3ad967a8e
       rev: 1
       content_meta: 128
       cas: 10486229639083020, expiry: 0, flags: 0
       data: (snappy) {"document-type: json"}
  ( repetead thousands of times )

Let’s go ahead and do this for all the vBuckets on a Couchbase server:

  $ for f in `ls $BUCKET_PATH`;
  do
    /opt/couchbase/bin/couch_dbdump $f > $DUMP_DIR/`basename $f`;
  done

The above snippet goes through the bucket found in the $BUCKET_PATH directory and dumps the contents in multiple files in $DUMP_DIR. The result is the same number of files with the same name as in the bucket, but in the human-readable format of couch_dbdump.

Import the data

What if we were to write a script to reinsert the documents in a new (or flushed) bucket? Let’s use Python:

  #!/usr/bin/python
  
  from couchbase import Couchbase
  from argparse import ArgumentParser
  from sys import argv
  from json import loads
  
  def process(doc, cb):
        fields = doc.split('\n')
        for f in fields:
                if f.strip() == '':
                        fields.remove(f)
  
        for f in fields:
                if 'id: ' in f:
                        key = f.split('id: ')[1]
                if 'data: ' in f:
                                try:
                                        data_j = f.split('data:', 1)[1]
                                        data_j = '{'+data_j.split('{', 1)[1]
                                        data = loads(data_j)
                                except:
                                        print 'Could not load data for', doc
                                        data = {}
                expiry = 0
                if 'expiry: ' in f:
                        expiry = f.split('expiry:', 1)[1].split(',', 1)[0]
                        expiry = int(expiry)
  
        try:
                cb.set(key = key, value = data, ttl = expiry)
        except:
                print 'Could not set to CB for', doc
  
  if __name__ == '__main__':
        f = open(argv[1])
        txt = f.read()
        # separate docs
        docs = txt.split('Doc seq:')
        # first one's always empty
        docs.pop(0)
  
        cb = Couchbase.connect( host = 'localhost', bucket = 'bucket',
                                password = 'swordfish')
        for d in docs:
                process(d, cb)

Please note that you need to have the .

The above script parses the couch_dbdump output and puts the document into Couchbase. Important: Only the key, actual Json document and expiry are preserved!

Now, let’s run the above on every dumped file using the power of bash:

  #!/bin/bash
  for f in `ls $DUMP_DIR/*couch*`
  do
        ./cb_recovery.py $f &
        sleep 1
  done

The above snippet runs an instance of the script for each dump file (this is what the & does — if for some reason you want to do it serially and slowly, remove the &). Hopefully, after several minutes or hours, your data should be back in the DB.