Skip to content

Latest commit

 

History

History
168 lines (120 loc) · 11.1 KB

readme.md

File metadata and controls

168 lines (120 loc) · 11.1 KB

TiDB Engine Extensions Library

Author(s): solotzg and CalvinNeo

Abstract

This repository is to introduce a TiKV based c dynamic library for extending storage system in TiDB cluster. It aims to export current multi-raft framework to other engines and make them be able to provide services(read/write) as raftstore directly.

Background

Initially, such framework was designed for Realtime HTAP scenarios. There is already a distributed OLTP storage product TiKV, and we could extend other kind of realtime analytics system based on current multi-raft mechanism to handle more complicated scenarios. For example, assume a strong schema-aware storage node could be accessed as a raftstore with special identification labels. Third-party components can use Placement Rules, provided by PD, to schedule learner/voter replicas into it. If such storage system has supported Multi-raft RSM, Percolator Transaction Model and Transaction Read Protocol, just like TiFlash(a distributed column-based storage) does, it will be appropriate for HTAP cases.

If transaction is not required, like most OLAP cases which only guarantee Eventual Consistency, and what matters more is throughput rather than latency. Then, data(formed by table schema or other pattern) could be R/W from this kind of raftstore directly.

Design

Overview

Generally speaking, there are two storage components in TiKV for maintaining multi-raft RSM: RaftEngine and KvEngine. KvEngine is mainly used for applying raft command and providing key-value services. RaftEngine will parse its own committed raft log into corresponding normal/admin raft commands, which will be handled by the apply process. Multiple modifications about region data/meta/apply-state will be encapsulated into one Write Batch and written into KvEngine atomically. It is an option to replace KvEngine with Engine Traits. But it's not easy to guarantee atomicity while writing/reading dynamic key-value pair(such as meta/apply-state) and patterned data(strong schema) together for other storage systems. Besides, a few modules and components(like importer or lighting) reply on the SST format of KvEngine in TiKV. It may cost a lot to achieve such a replacement.

It's suggested to let the apply process work as usual but only persist meta and state information to bring a few intrusive modifications against the original logic of TiKV. i.e., we must replace everywhere that may write normal region data with related interfaces. Unlike KvEngine, the storage system(called engine-store) under such a framework should be aware of the transition about multi-raft RSM from these interfaces. The engine-store must have the ability to deal with raft commands to handle queries with region epoch.

The region snapshot presents the complete region information(data/meta/apply-state) at a specific apply-state. Anyway, because there are at least two asynchronous runtimes in one program, the best practice of such raft store is to guarantee External Consistency by region snapshot.

The raft logs persisted in RaftEngine are the WAL(Write-ahead Log) of the apply process. Index of raft entry within the same region peer is monotonic increasing. If the process is interrupted at the middle step, it should replay from the last persisted apply-state after the restart. Until a safe point is reached, related modifications are not visible to others.

Idempotency is an essential property for External Consistency, which means such a system could handle outdated raft commands. A practical way is like:

  • Fsync snapshot in engine-store atomically.
  • Fsync region snapshot in raftstore-proxy atomically.
  • Make RaftEngine only GC raft log whose index is smaller than persisted apply-state.
  • engine-store should screen out raft commands with outdated apply-state during apply process.
  • engine-store should recover from the middle step by overwriting and must NOT provide services until caught up with the latest state.

Such architecture can inherit several important features from TiKV, such as distributed fault tolerance/recovery, automatic re-balancing, etc. It's also convenient for PD to maintain this kind of storage system by the existing way as long as it works as raft store.

We've managed to refactor TiKV source code and extract parts of the necessary process into interfaces. Learn more in new-proxy.

Interfaces

The main categories are like:

  • applying commands
    • applying normal-write raft command.
    • applying IngestSst write command.
    • applying admin raft command.
  • region management
    • peer detection: destroy peer.
    • region snapshot: pre-handle/apply region snapshot.
  • replica read: batch read-index
  • utils and support services
    • status services: metrics; CPU profile; config; thread stats; self-defined API.
    • pd-related store stats: key/bytes R/W stats; disk stats; engine-store stats.
    • encryption: get file; new file; delete file; link file; rename file.
    • SST file reader.
    • tools/utils.

Handling Commands

TiKV can split or merge regions to make the partitions more flexible:

  1. When the size of a region exceeds the limit, it will split into two or more regions, and its range would change from [a, c) to [a, b) and [b, c).
  2. When the sizes of two consecutive regions are small enough, TiKV will merge them into one, and their range would change from [a, b) and [b, c) to [a, c).

We must persist the region snapshot when executing admin raft commands about split, merge or change peer because such commands will change the core properties(version, conf version, start/end key) of multi-raft RSM.

Ignorable admin command CompactLog may trigger raft log GC in RaftEngine. Thus, to execute such commands, it's required to persist region snapshot. Since a persistence may be expenseive in some storage engines, we introduce a pre_exec observer in TiKV. If a persistence can't be done, the execution of CompactLog can be filtered by TiKV.

ComputeHash/VerifyHash are used to check consistency. However, they rely on a Rocksdb-based implementatiob storage layer. Those commands can also be filtered by pre_exec observer.

However executing normal-write or some other admin commands won't change region meta, the decision of persisting can be pushed down to engine-store. We introduce a post_exec observer in TiKV, which tells TiKV to do a commit right after we finished executing these commands.

When the region in the current store is illegal or pending removal, TiKV will execute a destroy-peer task to clean useless data, which will also be observed by proxy and reported to engine-store.

When maintaining DWAL, it's practical to batch raft msg before fsync as long as latency is tolerable to reduce IOPS(mainly in RaftEngine) and make it system-friendly with poor performance.

ReadIndex

According to the basic transaction log replication, a leader peer must commit or apply each writing action before returning success ACK to the client. When any peer tries to respond to queries, it should get the latest committed index from the leader and wait until the apply-state caught up to ensure it has enough context. For learners/followers or even leaders, the Read Index is a practical choice to check the latest Lease because it's easy to make any peer of region group provide read service under the same logic as the overhead of read-index itself is insignificant.

Apply Snapshot

When the leader peer has reclaimed related raft log or other peers can not proceed with RSM in the current context, other peers can request a region snapshot from the leader. However, the region snapshot data, whose format is TiKV's SST file, is not usually used by other storage systems directly. The standard process has been divided into several parts to accelerate the speed of applying region snapshot data:

  • SST File Reader to read key-value one by one from SST files.
  • Multi-thread pool to pre-handle SST files into the self-defined structure of engine-store.
  • Delete old data within [start-key, end-key) of the new region strictly.
  • Apply self-defined structure by original sequence.

IngestSst

Interfaces about IngestSst are the core to be compatible with TiDB Lighting and BR for the HTAP scenario. It can substantially speed up data loading/restoring. SST File Reader is also useful when applying the IngestSst raft command.

Encryption

Encryption is essential for DBaaS(database as a service). To be compatible with TiKV, a data key manager with the same logic is indispensable, especially for rotating data encryption keys or using the KMS service.

Status Server

Status services like metrics, CPU/Memory profile(flame graph), or other self-defined stats can effectively support the diagnosis. It's suggested to encapsulate those into one status server and let other external components visit through the status address. We could also reuse most of the original metrics of TiKV, and an optional way is to add a specific prefix for each name.

FFI

Since the program language Rust, which TiKV uses, has zero-cost abstractions, it's straightforward to let different threads interact with each other by FFI(Foreign Function Interface). Such mode brings almost no overhead. However, any caller must be pretty clear about the exact safe/unsafe operations boundary. The structure used by different runtimes through interfaces must have the same memory layout.

Usage

There are two exposed extern "C" functions in raftstore-proxy:

  • print_raftstore_proxy_version: print necessary version information(just like TiKV does) into standard output.
  • run_raftstore_proxy_ffi:
    • the main entry accepts established function pointer interfaces and command arguments.
    • it's suggested to run main entry function in another independent thread because it will block current context.

To use this library, please follow the steps below:

  • Install grpc, protobuf, c++, rust.
  • Include this project as submodule.
  • Modify FFI Source Code under namspace DB if necessary and run make gen_proxy_ffi.
  • Run ENGINE_LABEL_VALUE=xxx make release:
    • label engine:${ENGINE_LABEL_VALUE} will be added to store info automatically.
    • prefix ${ENGINE_LABEL_VALUE}_proxy_ will be added to each metrics name.
  • Include FFI header files and implement related interfaces (mainly struct EngineStoreServerHelper and struct RaftStoreProxyFFIHelper) by c++.
  • Compile and link target library target/release/lib${ENGINE_LABEL_VALUE}_proxy.dylib|so.

Interfaces Description

TODO

  • support R/W as Leader
  • resources control
  • async future framework
  • direct writing

Contact

Rongzhen Luo ([email protected])

License

Apache 2.0 license. See the LICENSE file for details.

Acknowledgments

  • Thanks tikv for providing raftstore.
  • Thanks pd for providing placement rules.