Skip to content

Releases: microsoft/DLWorkspace

Deep Learning Training Service v1.7.0

15 Jul 01:00
97268d2
Compare
Choose a tag to compare

Job Manager

  • Support setting different scheduling policies per VC.
    • RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
    • FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
  • Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
  • Support limiting number of interactive GPUs per VC.
  • Support user global public keys, enabling users to access jobs in any cluster using their own private key.
  • Requeue preempted jobs at the head of the job queue.
  • Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
  • Delete very old jobs in small batches to avoid locking DB.

Restful API

  • Allow specifying max retry count for each job.
  • Support changing parameters per VC:
    • Max job time
    • Max number of interactive GPUs
    • Scheduling policy
  • Allow adding user IP for allowlist.
  • VC quota management proportional to GPU/CPU.

Dashboard

  • VC notification
  • Show worker node count for pure CPU cluster.
  • Add timeout column for jobs in View and Manage Jobs.
  • Show insight message(s) on job details page for running jobs.
  • Show repair message(s) on job details page for running jobs.
  • Add Visual Studio Code (alpha) as an endpoint on job details page.
  • Allow downloading full job logs.
  • Allow specifying max retry count on job submission page.
  • Show repair status for worker nodes.
  • Show snapshot time on STORAGE tab.
  • Support exporting STORGAE tab as csv.
  • Add SETTINGS tab for VC admins to manage VC parameters.
  • Add a hidden page for cluster admins to manage VC quota.
  • Add My SSH Keys page for users to upload global public keys.
  • Add My Allowed IP page for users to self-serve allowing their IP.

Monitoring and RepairManager

  • Fix incorrect mapping for DCGM GPU metrics.
  • Auto-manage repair cycle of nodes according to predefined set of rules.
  • Add a Node Repair State dashboard for repair monitoring.

Storage Manager

  • Delete an expired directory file-by-file to avoid locking NFS.
  • Take ctime into consideration when expiring files.

Lustre

  • Support default storage quota per person (with configurable hard/soft limit and grace period).
  • Support multi-MDT in auto-deployment pipeline.
  • Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.

(Azure) AllowList Manager

  • Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
  • Expire user IPs after a specified number of days.

Deep Learning Training Service v1.6.0

27 May 21:42
7556f34
Compare
Choose a tag to compare

Job Manager

  • Use username to run inference worker command
  • Support preempting inference jobs
  • Support preempting running preemptible jobs

Restful API

  • Override GPU type in job submission to avoid incorrect resource accounting by GPU type

Monitoring

  • NVSM health metrics for DGX-2
  • Add Prometheus aggregate rules for federation scrape
  • Expose health and performance metrics in Lustre
  • Remove data retirement in job-exporter metrics collection to avoid data missing
  • Expose Infiniband metrics
  • Distinguish metrics from preemptible jobs
  • GPU hours at cluster, VC, user, and job level
  • NFS storage usage by user
  • Monitor job pod phase
  • Add a centralized email sender

Dashboard

  • Storage tab in cluster status
  • End-to-end test in browser

Deployment

  • Lustre integration in cloud init deployment pipeline
  • Map old configs to cloud init format

Insight

  • Provide insight for running GPU jobs at backend

Deep Learning Training Service v1.5.0

18 Apr 05:08
Compare
Choose a tag to compare

Job Manager

  • Remove NFS dependency during job initialization
  • Use FrameworkLauncher
  • Generalize VC resource quota (including CPU, GPU, memory)
  • Direct NFS client mount into job pods using k8s nfs plugin
  • Accelerate job endpoint bring-up
  • Fix double count bug in unschedulable GPU accounting
  • Retry 3 times for failed job

Restful API

  • Force entire node allocation for distributed jobs

Dashboard

  • UX improvement on View and Manage Jobs V2
  • Cluster Status V2 with heterogeneous SKU support and assigned GPU utilization display

Logging

  • Switch from fluentd to fluent-bit in log collection
  • Elasticsearch based logging for job logs
  • Azure blob based logging for job logs

Restful API and MySQL DB

  • Batch kill/pause/resume/approve jobs
  • Fix the semantic of and further optimize job list retrieval
  • Simplify DB by merging job_priorities table into jobs table

Monitoring

  • Non-idle GPU utilization
  • Relay on docker state to report network metric when not using host network

Storage Manager

  • Auto-expiration of paths under specified paths
  • Remove double usage count for hard-links

Repair Manager

  • Job migration out of VM with uncorrectable ECC error

Fundamental

  • Cloud init deployment of the whole cluster
  • Scaling up/down worker VMs
  • Parallel integration tests
  • Nvidia persistence daemon by default
  • Reorder GPUs to avoid NVLink break in VM

Deep Learning Training Service v1.4.0

04 Feb 05:17
Compare
Choose a tag to compare

Job Manager

  • Improve 95th percentile job creation (from job submission to "scheduling") time from 400s to 46s.
  • Speed up job initialization by prebuilding and copying required apt packages from an init container
  • Per-user password for ssh login for user jobs
  • Azure blobfuse plugin(s) for a job
  • Custom docker registry secret(s) for a job
  • Scheduling jobs on pure CPU machines
  • VC machine hard assignment
  • Provide consistent environment variables for training in both interactive and non-interactive SSH

Restful API

  • Improve 95th percentile latency for job info and permission related Restful APIs from 2000ms to <500ms.

Web Portal (Dashboard)

  • Speed up page loading for "View and Manage Jobs" - "View and Manage Jobs V2"
  • Dashboard as a Kubernetes service

User Synchronization

  • Automate the user/group permission update process

Storage Manager

  • Scan NFS and send alert email for over-sized (boundary) paths when NFS storage usage exceeds threshold.

Repair Manager

  • Detect and send alert email for uncorrectable ECC errors

Fundamental

  • Fix occasionally failed NFS mounting upon machine restart

Deep Learning Training Service v1.2.0

07 Oct 20:15
8767c65
Compare
Choose a tag to compare

JobManager

  • Priority-adjustable job scheduling (by VC admin)
  • Job pausing and resuming
  • VC level user quota control
  • Infiniband topology aware scheduling
  • Support inference job type

WebPortal

  • New webportal in ReactJS and Koa
  • GPU fragmentation histogram on job submission page
  • Idle GPU count, monthly booked GPU hours and monthly idle GPU hours per user
  • Per VC GPU usage dashboard

Fundamental

  • Linux kernel upgrade from 4.x to 5.x
  • K8s upgrade from v1.9.0 to v1.15.2
  • NVIDIA driver upgrade to 430 series

Deep Learning Training Service v1.1.0

07 Aug 18:50
1375359
Compare
Choose a tag to compare

Jobmanager

  • Refactor jobmanager code.
  • Support distributed non-interactive (batch) job.
  • Restart failed jobs due to machine failure.
  • Restart preemptible jobs.
  • Support GPU packing and master-worker collocation (for distributed jobs) with k8s pod affinity.

Monitoring

  • GPU aggregated ECC error detection and alerting.
  • Cluster-wide GPU utilization dashboard.
  • Performance metrics on database connection latency for jobmanager and restfulapi server.
  • Configurable auto-kill (with email) for idle jobs in each virtual cluster.

Web Portal

  • Enable preemptible jobs.
  • Clean up advanced tab in job submission page.