Skip to content

Deep Learning Training Service v1.5.0

Compare
Choose a tag to compare
@Anbang-Hu Anbang-Hu released this 18 Apr 05:08
· 2 commits to v1.5 since this release

Job Manager

  • Remove NFS dependency during job initialization
  • Use FrameworkLauncher
  • Generalize VC resource quota (including CPU, GPU, memory)
  • Direct NFS client mount into job pods using k8s nfs plugin
  • Accelerate job endpoint bring-up
  • Fix double count bug in unschedulable GPU accounting
  • Retry 3 times for failed job

Restful API

  • Force entire node allocation for distributed jobs

Dashboard

  • UX improvement on View and Manage Jobs V2
  • Cluster Status V2 with heterogeneous SKU support and assigned GPU utilization display

Logging

  • Switch from fluentd to fluent-bit in log collection
  • Elasticsearch based logging for job logs
  • Azure blob based logging for job logs

Restful API and MySQL DB

  • Batch kill/pause/resume/approve jobs
  • Fix the semantic of and further optimize job list retrieval
  • Simplify DB by merging job_priorities table into jobs table

Monitoring

  • Non-idle GPU utilization
  • Relay on docker state to report network metric when not using host network

Storage Manager

  • Auto-expiration of paths under specified paths
  • Remove double usage count for hard-links

Repair Manager

  • Job migration out of VM with uncorrectable ECC error

Fundamental

  • Cloud init deployment of the whole cluster
  • Scaling up/down worker VMs
  • Parallel integration tests
  • Nvidia persistence daemon by default
  • Reorder GPUs to avoid NVLink break in VM