Deep Learning Training Service v1.5.0

Anbang-Hu released this 18 Apr 05:08

· 2 commits to v1.5 since this release

Job Manager

Remove NFS dependency during job initialization
Use FrameworkLauncher
Generalize VC resource quota (including CPU, GPU, memory)
Direct NFS client mount into job pods using k8s nfs plugin
Accelerate job endpoint bring-up
Fix double count bug in unschedulable GPU accounting
Retry 3 times for failed job

Restful API

Force entire node allocation for distributed jobs

Dashboard

UX improvement on View and Manage Jobs V2
Cluster Status V2 with heterogeneous SKU support and assigned GPU utilization display

Logging

Switch from fluentd to fluent-bit in log collection
Elasticsearch based logging for job logs
Azure blob based logging for job logs

Restful API and MySQL DB

Batch kill/pause/resume/approve jobs
Fix the semantic of and further optimize job list retrieval
Simplify DB by merging job_priorities table into jobs table

Monitoring

Non-idle GPU utilization
Relay on docker state to report network metric when not using host network

Storage Manager

Auto-expiration of paths under specified paths
Remove double usage count for hard-links

Repair Manager

Job migration out of VM with uncorrectable ECC error

Fundamental

Cloud init deployment of the whole cluster
Scaling up/down worker VMs
Parallel integration tests
Nvidia persistence daemon by default
Reorder GPUs to avoid NVLink break in VM

Assets 2