CentaurusInfra / alnair Public

Notifications You must be signed in to change notification settings
Fork 12
Star 37

Code
Issues 63
Pull requests
Discussions
Actions
Projects 1
Wiki
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Wiki
Security
Insights

Issues: CentaurusInfra/alnair

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

63 Open 18 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

GDS traffic monitor

#147 opened Oct 31, 2022 by pint1022

[alnair device plugin] feature request -- support GPU selection

#143 opened Oct 24, 2022 by Fizzbb

GPUDirect to local SSD

#136 opened Jul 27, 2022 by Fizzbb

intercept-lib test instruction doesn't work.

#132 opened Jul 15, 2022 by awang088

Add prometheus export to report process-level GPU utilization and memory used size

#131 opened Jun 20, 2022 by Fizzbb

scheduling needs

#128 opened Apr 14, 2022 by Fizzbb

vGPU scheduler assume all the nodes have GPU information annotation. Cannot handle cpu node or the period before annotation got patched

#125 opened Apr 7, 2022 by Fizzbb

device-plugin installation error, Init:crashloopback

#122 opened Apr 6, 2022 by Fizzbb

vgpu-server container failed to start, "run/nvidia-persistenced/socket" no such device or address

#119 opened Apr 3, 2022 by Fizzbb

intercept lib launched through LD_PRELOAD cannot intercept cuda driver API calls with pytorch version >=1.10

#114 opened Mar 30, 2022 by Fizzbb

use nsight system inside containers

#112 opened Mar 23, 2022 by Fizzbb

Add pre-start hook to all containers in container runtime to support GPU access

#110 opened Mar 15, 2022 by Fizzbb

same node pods communication through unix socket

#109 opened Mar 15, 2022 by Fizzbb

create an exporter to export burst, overuse and window-size metrics to prometheus.

#108 opened Mar 8, 2022 by pint1022

setup multiple nodes cluster for kubeshare performance testing

#107 opened Mar 8, 2022 by pint1022

setup tf-serving testing environment for kubeshare throughput testing

#106 opened Mar 8, 2022 by pint1022

horovod mnist.py has higher utilization number. what does it do?

#105 opened Mar 8, 2022 by pint1022

GPU sharing corner case: vGPUs spread to two or more physical GPUs

#98 opened Feb 18, 2022 by Fizzbb

Design and Implement a good GPU utilization metrics

#93 opened Feb 2, 2022 by Fizzbb

revise alnair devicepluginserver to connect the running pod/container info with the device

#92 opened Feb 2, 2022 by YHDING23

Add GPU metrics to Pod metrics for Job metadata

#91 opened Feb 2, 2022 by Fizzbb

Patch Pod Spec Annotations

#89 opened Jan 26, 2022 by YHDING23

Kubeshare prototyping and compute sharing deep dive

#88 opened Jan 26, 2022 by Fizzbb

how to run the throughput testing on Kubeshare?

#86 opened Jan 26, 2022 by pint1022

fairseq multihead_attention, torch.cat cause RuntimeError: CUDA out of memory

#83 opened Jan 25, 2022 by Fizzbb

Previous 1 2 3 Next

Previous Next

ProTip! Type g i on any issue or pull request to go back to the issue listing page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly