Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeshare prototyping and compute sharing deep dive #88

Open
Fizzbb opened this issue Jan 26, 2022 · 4 comments
Open

Kubeshare prototyping and compute sharing deep dive #88

Fizzbb opened this issue Jan 26, 2022 · 4 comments
Assignees

Comments

@Fizzbb
Copy link
Collaborator

Fizzbb commented Jan 26, 2022

No description provided.

@Fizzbb Fizzbb moved this to In Progress in Alnair 20220130 Jan 26, 2022
@pint1022
Copy link
Collaborator

pint1022 commented Feb 2, 2022

Comparing with GaiaGPU solution, I think Kubeshare's architecture has some good designs to monitor/manage the compute resources.

@Fizzbb
Copy link
Collaborator Author

Fizzbb commented Feb 2, 2022

To do: Run a pod with 50% utilization request with KubeShare, and monitor GPU utilization fluctuation.

@pint1022
Copy link
Collaborator

Gemini algorithm major features:

  1. vgpu hook is assigned by container. it is event-driven; communicates with the gpu-isolation-module by ipc. major job: kernel watchdog, heartbeat, interrupt overuse.
  2. gpu-isolation-module is a daemon process. it is event-driven; handles ipc requests; creates new sharepod; issue monitor threads by pods. major algorithm: usage-monitoring; sliding-window; request-queue; priority-management.

@pint1022
Copy link
Collaborator

Gemini algorithm is deserved to be further explored and tested. I like it very much. The next step analysis involves with a lot of coding and testing. two main questions: 1 Will GPU be idle if one pod runs out of quota and there is no other pod/thread asking for GPU? answer: normally it doesn't run into situation. even it runs into that unique situation, it should be avoidable by changing parameters. 2. does cuda do context switching when the pod is changed? It is vGPU driver's question, not Kubeshare's question. In my research, vGPU driver should use one context among all sharing threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants