Releases
v1.5.0
Deep Learning Training Service v1.5.0
Job Manager
Remove NFS dependency during job initialization
Use FrameworkLauncher
Generalize VC resource quota (including CPU, GPU, memory)
Direct NFS client mount into job pods using k8s nfs plugin
Accelerate job endpoint bring-up
Fix double count bug in unschedulable GPU accounting
Retry 3 times for failed job
Restful API
Force entire node allocation for distributed jobs
Dashboard
UX improvement on View and Manage Jobs V2
Cluster Status V2 with heterogeneous SKU support and assigned GPU utilization display
Logging
Switch from fluentd to fluent-bit in log collection
Elasticsearch based logging for job logs
Azure blob based logging for job logs
Restful API and MySQL DB
Batch kill/pause/resume/approve jobs
Fix the semantic of and further optimize job list retrieval
Simplify DB by merging job_priorities
table into jobs
table
Monitoring
Non-idle GPU utilization
Relay on docker state to report network metric when not using host network
Storage Manager
Auto-expiration of paths under specified paths
Remove double usage count for hard-links
Repair Manager
Job migration out of VM with uncorrectable ECC error
Fundamental
Cloud init deployment of the whole cluster
Scaling up/down worker VMs
Parallel integration tests
Nvidia persistence daemon by default
Reorder GPUs to avoid NVLink break in VM
You can’t perform that action at this time.