Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: redesign controller-replica communication #1246

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kampadais
Copy link

Which issue(s) this PR fixes: longhorn/longhorn#6600

What this PR does / why we need it:

This PR is implementing a new communication structure between the controller and the replicas. The main focus of this restructure is the remove of the loop function and replace it with a more scalable approach.

Based on the current implementation, frontent issuing a new operation goes into a request channel and wait for the loop function to serve it. The same approach is followed with the replies from the replicas. The main fault of the loop function is that there is only on thread running and it both serves requests and replies. This is because Messages map can be R/W by only one thread. So even if there are a bunch of incoming requests or replies the systems is capped on serving these operations one at a time. This solution is not scalable.

The new approach scales the operation process removing the loop function and replacing it with immediate call of handleRequest and handleResponse. In order to make this possible there is also a need to prevent concurrent R/W in the Messages map. I replaced it with a Messages array and a Seq channel that ensures that every index of the array will be accessed by only one thread. What the new structs do :

  • Messages array : It stores the data equivalant with the Messages map but has a fixed length equal to the number of the inflight I/O we want to cap it. It's index is the id/Seq of the message.

  • Seq channel : It stores all the available Seqs that a new operation can acquire. Go channels ensure that only one thread will get the next available Seq. When we init the Client the Seq channel gets populated with numbers from 0 to Size(Messages). When an operation is completed the client stores the , now available , index back to the Seq channel

With this approach when a reply come from the replica the client is available to identify the message using the id as index in the Messages array ensuring the integrity of the Messages array.

Basic data-path flow :

  1. Frontend issues a new operation
  2. Client accepts it with one of the WriteAt/UnmapAt/ReadAt
  3. func operation is called , formats the message and calls handleRequest()
  4. handleRequest give the message a Seq from the Seq channel and forwards it to the send channel
  5. This part is not changed and follows the flow of the current version
  6. When the read function reads a reply on the wire it immediately calls handleResponse()
  7. handleResponse identify the operation replied using the Seq as an index on the Message array and after it manipulates it , it notifies the Complete channel of the message to complete the request
  8. After the request is completed the operation function reinserts the Seq used in the Seq channel in order to be used by the next operation

This approach uses multiple concurrent threads that each serves a different request and is scalable up to the frontend's and network's capabilities

Benchmarks

As mentioned before this approach is scalable up to the frontend's capabilities. Since tgt is a far from optimal solution using it as a frontend shows minimal to zero performance boost. In order to see the potential of this approach i used the Ublk frontend i have implemented in here .

Specs :
CPU : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
RAM: 128GB
Disk: Samsung 860 Evo SSD 250GB
Network : 10Gbps

One replica , one controller , 2 machines in a local cluster
On the ublk+optimizations :
      Ublk frontend queues : 6
      Controller-Replica connections : 6

The fio command used for testing IOPS:
sudo fio --name=read_iops --filename=/dev/ublkb0 --size=5G --numjobs=12 --time_based --runtime=30s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randwrite --group_reporting=1
fio command used for testing Bandwidth:

sudo fio --name=bandwidth --filename=/dev/ublkb0 --size=5G --numjobs=12--time_based --runtime=30s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=read --group_reporting=1

Additional documentation or context

image

*Replica R/W disabled means that the replica code is altered in order to answer dummy replies and not do the actual R/W in the Linux sparse files.

Note that with this approach we manage to scale the system to the Network's speed disabling the Replica R/W ( which is another bottleneck which need investigation given that the system up to this part can perform in these numbers)

Special notes for your reviewer:

I know i have removed some of the client's functionality regarding Journal but i find those part of the bottleneck . I am open in conversation in order to put any necessary part of it back

@Kampadais Kampadais changed the title Improved controller-replica communication feat: redesign controller-replica communication Oct 9, 2024
@liyimeng
Copy link

any cpu usage number available?

@Kampadais
Copy link
Author

any cpu usage number available?

Sorry , i forgot to include this. When actual R/W is disabled i get around 25-30% using tgt-blockdev and 75-80% with ublk as frontend.

@liyimeng
Copy link

any cpu usage number available?

Sorry , i forgot to include this. When actual R/W is disabled i get around 25-30% using tgt-blockdev and 75-80% with ublk as frontend.

Thanks @Kampadais! So it is higher than tgt, but lower than v2 engine?

@Kampadais
Copy link
Author

Sorry for the late reply. Yes , it is lower but given the complexity and requirements of V2 engine and longhorn/longhorn#6600 there is also the need to improve V1 performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EPIC] v1 data path performance enhancement
2 participants