feat: redesign controller-replica communication #1246
+54
−147
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue(s) this PR fixes: longhorn/longhorn#6600
What this PR does / why we need it:
This PR is implementing a new communication structure between the controller and the replicas. The main focus of this restructure is the remove of the loop function and replace it with a more scalable approach.
Based on the current implementation, frontent issuing a new operation goes into a request channel and wait for the loop function to serve it. The same approach is followed with the replies from the replicas. The main fault of the loop function is that there is only on thread running and it both serves requests and replies. This is because Messages map can be R/W by only one thread. So even if there are a bunch of incoming requests or replies the systems is capped on serving these operations one at a time. This solution is not scalable.
The new approach scales the operation process removing the loop function and replacing it with immediate call of handleRequest and handleResponse. In order to make this possible there is also a need to prevent concurrent R/W in the Messages map. I replaced it with a Messages array and a Seq channel that ensures that every index of the array will be accessed by only one thread. What the new structs do :
Messages array : It stores the data equivalant with the Messages map but has a fixed length equal to the number of the inflight I/O we want to cap it. It's index is the id/Seq of the message.
Seq channel : It stores all the available Seqs that a new operation can acquire. Go channels ensure that only one thread will get the next available Seq. When we init the Client the Seq channel gets populated with numbers from 0 to Size(Messages). When an operation is completed the client stores the , now available , index back to the Seq channel
With this approach when a reply come from the replica the client is available to identify the message using the id as index in the Messages array ensuring the integrity of the Messages array.
Basic data-path flow :
This approach uses multiple concurrent threads that each serves a different request and is scalable up to the frontend's and network's capabilities
Benchmarks
As mentioned before this approach is scalable up to the frontend's capabilities. Since tgt is a far from optimal solution using it as a frontend shows minimal to zero performance boost. In order to see the potential of this approach i used the Ublk frontend i have implemented in here .
Specs :
CPU : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
RAM: 128GB
Disk: Samsung 860 Evo SSD 250GB
Network : 10Gbps
One replica , one controller , 2 machines in a local cluster
On the ublk+optimizations :
Ublk frontend queues : 6
Controller-Replica connections : 6
The fio command used for testing IOPS:
sudo fio --name=read_iops --filename=/dev/ublkb0 --size=5G --numjobs=12 --time_based --runtime=30s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randwrite --group_reporting=1
fio command used for testing Bandwidth:
sudo fio --name=bandwidth --filename=/dev/ublkb0 --size=5G --numjobs=12--time_based --runtime=30s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=read --group_reporting=1
Additional documentation or context
*Replica R/W disabled means that the replica code is altered in order to answer dummy replies and not do the actual R/W in the Linux sparse files.
Note that with this approach we manage to scale the system to the Network's speed disabling the Replica R/W ( which is another bottleneck which need investigation given that the system up to this part can perform in these numbers)
Special notes for your reviewer:
I know i have removed some of the client's functionality regarding Journal but i find those part of the bottleneck . I am open in conversation in order to put any necessary part of it back