This is the current database file format used by Couchbase Server 2.0. It's similar, but not identical, to the format used by Apache CouchDB 1.2. It is implemented in a Couchbase 2.0 fork of CouchDB (in Erlang), and also by Couchstore in C.
The most important thing to understand about a CouchDB file is that it is append-only. The only way the file is modified is to append new data at the end; bytes written are never overwritten. As a consequence, the critical file "header" data actually lives at the tail end of the file, since it has to be re-appended every time the file is changed.
This has several advantages:
- The data format is extremely robust, since the file is never in an inconsistent state. Even if the process crashes or the kernel panics partway through writing an update, the software can recover simply by scanning back from the end of the file to the last valid header.
- Writers don't disturb readers at all, allowing concurrent access without read locks.
- By default, earlier versions of a record remain in the file, making it easy to implement a version history and multi-version concurrency control (as exposed in the CouchDB API.)
The disadvantage, of course, is that the file grows without bound as it's modified. To work around this, the file is periodically compacted by writing the live data to a new file, then replacing the old file with the new one. This can be done in the background without locking out either readers or writers.
- Numbers are in big-endian byte order (most significant byte first).
- All values are tightly packed; there is no padding or multi-byte alignment.
- Some values occupy partial bytes and have lengths that are not a multiple of 8 bits. These are stored in most-significant to least-significant bit order. So for example, if a 1-bit field is followed by a 47-bit field, the first field occupies the MSB of the first byte, and the second occupies the rest of the first byte and the entirety of the next five bytes.
For purposes of locating the current file header (q.v.), the file is organized into 4096-byte blocks. The first byte of every block is reserved and identifies whether it's a data block (zero) or a header block (nonzero). Therefore only 4095 of the bytes are available for storing data.
Above the block level, these prefix bytes are invisible and simply skipped. So a data value that spans a block boundary will be written out with a zero byte inserted at the boundary, and this byte will be removed when reading the value.
The data in the file (above the block level) is grouped into variable-length chunks. All chunks are prefixed with their length as a 32-bit big endian integer, followed by the CRC32 checksum of the data. The CRC32 is not included in the length. The data chunk length has its highest bit set.
length | content |
---|---|
32 bits | body length (highest bit set) |
32 bits | CRC32 checksum of body |
followed by the body data
Similarly to plain data chunks, encrypted chunks are prefixed with the ciphertext length as a 32-bit big endian integer. The highest bit is not set. The length is followed by the ciphertext, which includes the MAC tag. No extra checksum is included.
A per-file random key is used, which is generated when creating the file.
GCM is used with a 96-bit (12-byte) nonce/IV. The ciphertext is suffixed by a 16-byte MAC tag.
As data chunks are always appended to the end of the file, we use the offset in the file as a unique value for the nonce. A chunk that appears at the beginning of a block may be read from the block boundary or one byte ahead, as the first byte of a block is the block type which will be skipped over. To ensure that the same nonce is used for encryption and decryption, we need to adjust the nonce such that it is the same in those two cases. If the offset is at the block boundary, we increment the nonce by one.
A file header always appears on a 4096-byte block boundary, and the first byte of the block is 0x01 to signal that it contains a header. A file will contain many headers as it grows, since a new updated one is appended whenever data is saved; the current header is the last one in the file. So the algorithm to find the header when opening the file is to seek to the last block boundary, read one byte, and keep skipping back 4096 bytes until the byte read is nonzero.
The file header is prefixed with a 32 bit length and a checksum, similarly to other data chunks, but the length field does include the length of the hash.
File headers are not encrypted.
The CRC32 variant in header and data chunks depends on the file version. Since version 12, CRC32C (Castagnoli) is used, as some processors provide instructions for accelerating its computation.
Version 14 introduces support for encryption and adds the position of the previous header to the file header. The position is all ones if there is no previous header.
Values in the body of a file header:
length | content |
---|---|
8 bits | File format version (currently 14) |
48 bits | Sequence number of next update. |
48 bits | Purge counter. |
48 bits | Purged documents pointer. (unused) |
16 bits | Size of by-sequence B-tree root |
16 bits | Size of by-ID B-tree root |
16 bits | Size of local documents B-tree root |
64 bits | Timestamp (present from version 13) |
48 bits | Position of previous header (present from version 14) |
- The B-tree roots, in the order of the sizes, are B-tree node pointers as described in the "Node Pointers" section.
When encryption is used, the master key ID and encrypted file key is stored in a metadata header. This appears as the first chunk of the first block. The first block then has type 0x02.
length | content |
---|---|
32 bits | Length of the following (including checksum) |
32 bits | CRC32C checksum of the following |
8 bits | Version (currently 0) |
8 bits | Key ID length |
... | Key ID |
8 bits | Encrypted File Key length |
... | Encrypted File Key |
The encrypted file key is prefixed by a random nonce and suffixed by the MAC tag (all included in length). The key ID is Authenticated as Associated Data.
The B-trees used in CouchDB files are a bit different than in a typical implementation, because the file is append-only. A tree node is never updated in place; instead, a new copy of the updated node is written at the end of the file. Of course this means that the node's parent also has to be updated to point at the updated node, so the effect is that every modification has to rewrite all the nodes from the affected leaf up to the root. In practice this isn't too expensive, especially if multiple writes are batched together into one update.
CouchDB's B-trees also have to forgo the sibling-node chaining that's typically used to speed up sequential access. The reason is that updating a node would invalidate the pointers to it in its siblings, forcing those nodes to be updated as well, resulting in a huge cascade of updates. Instead, the iteration algorithm remembers the path back to the root and periodically backtracks to find the next sibling node.
All B-tree nodes are compressed using the Snappy algorithm. The descriptions following all refer to the uncompressed form.
- First byte -- 1 if a leaf (key/value) node, 0 if an interior (key/pointer) node
- A list of key-value or key-pointer pairs, each of which is:
- 12 bits -- Key size
- 28 bits -- Value size
- Followed by the Key data,
- then the Value or Pointer data
In interior nodes the Value parts of these pairs are pointers to another B-tree node, where keys less than or equal to that pair's Key will be.
In leaf nodes the values are interpreted differently by each index; see the Indexes section below.
- 48 bits -- Node position in file
- 48 bits -- Sub tree size, or disk size of B-tree data below this node. For pointers to leaf nodes this is the size on disk of the pointed to node, otherwise it is the sum of the size of the pointed to node and all the values of this field on the pointers in that node.
- 16 bits -- Reduce value size (NOT present in the root pointers embedded in the file header, as it can be inferred from the lengths in the header)
- Reduce value -- Stores statistics about the subtree beneath this node. The exact data format is different in each type of B-tree, and is detailed below.
A CouchDB file contains three B-trees:
- The by-ID index, which maps document IDs to values.
- The by-sequence index, which maps sequence numbers (database change numbers, which increase monotonically with every change) to document IDs and values.
- The local-documents index, which is conceptually the same as the
by-ID index except that the documents in it do not appear in the
by-sequence index, and by CouchDB convention the document IDs all
begin with the ASCII sequence
_local/
.
This B-tree is an index of documents by their IDs. The keys are simply
the document IDs, ordered lexicographically by raw bytes (as by the
memcmp
function.)
The values are:
length | content |
---|---|
48 bits | The sequence number this document was last modified at. If another revision of this document is later saved, this sequence number should be removed from the by-sequence index. |
32 bits | Size of the document data |
1 bit | Deleted flag. If this is set this document should be ignored in indexing. |
47 bits | Position of the document content on disk |
1 bit | 1 if the value is compressed using the Snappy algorithm |
7 bits | Content type code (q.v.) |
48 bits | Document revision number (rev_seq) |
Followed by
- Revision Metadata (q.v.)
Couchstore does not react to the content type code, document revision number, or revision metadata. These values are used by CouchDB and Couchbase Server.
Couchstore can optionally handle the deleted flag when asked to (users can read the sequence index without deleted records, or run a "purge" compaction that drops deleted items.)
The reduce value in interior nodes of this B-tree is:
length | content |
---|---|
40 bits | Number of documents that are not flagged as deleted. |
40 bits | Number of documents that are flagged as deleted |
48 bits | Total disk size of document content |
CouchDB uses this to identify the type of the document data.
-
0 -- JSON data
-
1 -- BLOB data (Attempted to parse as JSON but failed)
-
2 -- BLOB data (Parsed as syntactically valid JSON, but illegally contains reserved keys) There are no longer reserved keys in Couchbase, so this content-type code should never be seen. It should be okay to repurpose it at some point
-
3 -- BLOB data (No attempt to parse as JSON)
This is identifying information about a document revision that should be small enough to be kept in the indexes. This value is not interpreted by Couchstore.
It's currently used by Couchbase as
length | content |
---|---|
64 bits | Memcached CAS value (unique cookie used by memcache to allow atomic writes) |
32 bits | Expiration time of a value |
32 bits | User flags as specified by Memcache protocol |
The keys in this B-tree are 48-bit numbers representing the sequence number of a change to the database. This number is strictly increasing.
The values are:
length | content |
---|---|
12 bits | Size of the document ID |
28 bits | Size of the document data |
1 bit | Deleted flag (this item in the by-sequence index represents a deletion action if this is set) |
47 bits | Position of the document content on disk |
1 bit | 1 if the value is compressed using the Snappy algorithm |
7 bits | Content type code |
48 bits | Document revision number (rev_seq) |
Followed by
- Document ID
- Revision Metadata (variable size; extends to the end of the value record)
The reduce value in interior nodes of this B-tree is a 40 bit number of records
Local documents are used for file-level configuration and metadata.
The keys in the local document b-tree are the document IDs, including
the _local/
prefix, and as in the ID tree sorted by memcmp
.
The value in the local document b-tree is the raw JSON body of the local document.