Unfortunately, the restore tool from Couchbase failed, the automatic data recovery failed, but at least our backups worked (of course!). While waiting for the backups to be restored, we started thinking what we could have done if the backups had also failed. So we came up with the idea of restoring data directly from vBuckets.
What are vBuckets anyways? Every key from the database is attributed to a vBucket, and multiple vBuckets reside on each server. In theory, a vBucket is defined as the “owner” of a subset of the key space of a Couchbase Server cluster (for more detailed information, check out this paper). Basically, vBuckets are a bunch of files that contain your data. These files reside in a directory with the bucket’s name inside the document path. If I do a list of files on one of our buckets, I get a bunch of files named “%d.couch.%d” (i.e. a number followed by a dot, the word ‘couch’ and another number). These ‘magic’ files contain all your data.
Dump the data
Unfortunately, the data is stored in a binary format inside the vBuckets. Fortunately, Couchbase provides a small utility program called couch_dbdump (if you’ve installed the Debian package, its full path is /opt/couchbase/bin/couch_dbdump). Let’s look over the output (in this example, we’ve changed the actual Json document):
$ /opt/couchbase/bin/couch_dbdump 986.couch.11
Doc seq: 1873
id: offers::3bb6d87e0acfbd1431e33955a8c068d3ad967a8e
rev: 1
content_meta: 128
cas: 10486229639083020, expiry: 0, flags: 0
data: (snappy) {"document-type: json"}
( repetead thousands of times )
Let’s go ahead and do this for all the vBuckets on a Couchbase server:
$ for f in `ls $BUCKET_PATH`;
do
/opt/couchbase/bin/couch_dbdump $f > $DUMP_DIR/`basename $f`;
done
The above snippet goes through the bucket found in the $BUCKET_PATH directory and dumps the contents in multiple files in $DUMP_DIR. The result is the same number of files with the same name as in the bucket, but in the human-readable format of couch_dbdump.
Import the data
What if we were to write a script to reinsert the documents in a new (or flushed) bucket? Let’s use Python:
#!/usr/bin/python
from couchbase import Couchbase
from argparse import ArgumentParser
from sys import argv
from json import loads
def process(doc, cb):
fields = doc.split('\n')
for f in fields:
if f.strip() == '':
fields.remove(f)
for f in fields:
if 'id: ' in f:
key = f.split('id: ')[1]
if 'data: ' in f:
try:
data_j = f.split('data:', 1)[1]
data_j = '{'+data_j.split('{', 1)[1]
data = loads(data_j)
except:
print 'Could not load data for', doc
data = {}
expiry = 0
if 'expiry: ' in f:
expiry = f.split('expiry:', 1)[1].split(',', 1)[0]
expiry = int(expiry)
try:
cb.set(key = key, value = data, ttl = expiry)
except:
print 'Could not set to CB for', doc
if __name__ == '__main__':
f = open(argv[1])
txt = f.read()
# separate docs
docs = txt.split('Doc seq:')
# first one's always empty
docs.pop(0)
cb = Couchbase.connect( host = 'localhost', bucket = 'bucket',
password = 'swordfish')
for d in docs:
process(d, cb)
Please note that you need to have the .
The above script parses the couch_dbdump output and puts the document into Couchbase. Important: Only the key, actual Json document and expiry are preserved!
Now, let’s run the above on every dumped file using the power of bash:
#!/bin/bash
for f in `ls $DUMP_DIR/*couch*`
do
./cb_recovery.py $f &
sleep 1
done
The above snippet runs an instance of the script for each dump file (this is what the & does — if for some reason you want to do it serially and slowly, remove the &). Hopefully, after several minutes or hours, your data should be back in the DB.
