15 Jul 01:00

Anbang-Hu

97268d2

Deep Learning Training Service v1.7.0 Latest

Latest

Job Manager

Support setting different scheduling policies per VC.
- RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
- FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
Support limiting number of interactive GPUs per VC.
Support user global public keys, enabling users to access jobs in any cluster using their own private key.
Requeue preempted jobs at the head of the job queue.
Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
Delete very old jobs in small batches to avoid locking DB.

Restful API

Allow specifying max retry count for each job.
Support changing parameters per VC:
- Max job time
- Max number of interactive GPUs
- Scheduling policy
Allow adding user IP for allowlist.
VC quota management proportional to GPU/CPU.

Dashboard

VC notification
Show worker node count for pure CPU cluster.
Add timeout column for jobs in View and Manage Jobs.
Show insight message(s) on job details page for running jobs.
Show repair message(s) on job details page for running jobs.
Add Visual Studio Code (alpha) as an endpoint on job details page.
Allow downloading full job logs.
Allow specifying max retry count on job submission page.
Show repair status for worker nodes.
Show snapshot time on STORAGE tab.
Support exporting STORGAE tab as csv.
Add SETTINGS tab for VC admins to manage VC parameters.
Add a hidden page for cluster admins to manage VC quota.
Add My SSH Keys page for users to upload global public keys.
Add My Allowed IP page for users to self-serve allowing their IP.

Monitoring and RepairManager

Fix incorrect mapping for DCGM GPU metrics.
Auto-manage repair cycle of nodes according to predefined set of rules.
Add a Node Repair State dashboard for repair monitoring.

Storage Manager

Delete an expired directory file-by-file to avoid locking NFS.
Take ctime into consideration when expiring files.

Lustre

Support default storage quota per person (with configurable hard/soft limit and grace period).
Support multi-MDT in auto-deployment pipeline.
Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.

(Azure) AllowList Manager

Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
Expire user IPs after a specified number of days.

Assets 2

27 May 21:42

Anbang-Hu

v1.6.0

7556f34

Deep Learning Training Service v1.6.0

Job Manager

Use username to run inference worker command
Support preempting inference jobs
Support preempting running preemptible jobs

Restful API

Override GPU type in job submission to avoid incorrect resource accounting by GPU type

Monitoring

NVSM health metrics for DGX-2
Add Prometheus aggregate rules for federation scrape
Expose health and performance metrics in Lustre
Remove data retirement in job-exporter metrics collection to avoid data missing
Expose Infiniband metrics
Distinguish metrics from preemptible jobs
GPU hours at cluster, VC, user, and job level
NFS storage usage by user
Monitor job pod phase
Add a centralized email sender

Dashboard

Storage tab in cluster status
End-to-end test in browser

Deployment

Lustre integration in cloud init deployment pipeline
Map old configs to cloud init format

Insight

Provide insight for running GPU jobs at backend

Assets 2

18 Apr 05:08

Anbang-Hu

v1.5.0

769187d

Deep Learning Training Service v1.5.0

Job Manager

Remove NFS dependency during job initialization
Use FrameworkLauncher
Generalize VC resource quota (including CPU, GPU, memory)
Direct NFS client mount into job pods using k8s nfs plugin
Accelerate job endpoint bring-up
Fix double count bug in unschedulable GPU accounting
Retry 3 times for failed job

Restful API

Force entire node allocation for distributed jobs

Dashboard

UX improvement on View and Manage Jobs V2
Cluster Status V2 with heterogeneous SKU support and assigned GPU utilization display

Logging

Switch from fluentd to fluent-bit in log collection
Elasticsearch based logging for job logs
Azure blob based logging for job logs

Restful API and MySQL DB

Batch kill/pause/resume/approve jobs
Fix the semantic of and further optimize job list retrieval
Simplify DB by merging job_priorities table into jobs table

Monitoring

Non-idle GPU utilization
Relay on docker state to report network metric when not using host network

Storage Manager

Auto-expiration of paths under specified paths
Remove double usage count for hard-links

Repair Manager

Job migration out of VM with uncorrectable ECC error

Fundamental

Cloud init deployment of the whole cluster
Scaling up/down worker VMs
Parallel integration tests
Nvidia persistence daemon by default
Reorder GPUs to avoid NVLink break in VM

Assets 2

04 Feb 05:17

Anbang-Hu

v1.4.0

c52a2da

Deep Learning Training Service v1.4.0

Job Manager

Improve 95th percentile job creation (from job submission to "scheduling") time from 400s to 46s.
Speed up job initialization by prebuilding and copying required apt packages from an init container
Per-user password for ssh login for user jobs
Azure blobfuse plugin(s) for a job
Custom docker registry secret(s) for a job
Scheduling jobs on pure CPU machines
VC machine hard assignment
Provide consistent environment variables for training in both interactive and non-interactive SSH

Restful API

Improve 95th percentile latency for job info and permission related Restful APIs from 2000ms to <500ms.

Web Portal (Dashboard)

Speed up page loading for "View and Manage Jobs" - "View and Manage Jobs V2"
Dashboard as a Kubernetes service

User Synchronization

Automate the user/group permission update process

Storage Manager

Scan NFS and send alert email for over-sized (boundary) paths when NFS storage usage exceeds threshold.

Repair Manager

Detect and send alert email for uncorrectable ECC errors

Fundamental

Fix occasionally failed NFS mounting upon machine restart

Assets 2

07 Oct 20:15

Anbang-Hu

v1.2.0

8767c65

Deep Learning Training Service v1.2.0

JobManager

Priority-adjustable job scheduling (by VC admin)
Job pausing and resuming
VC level user quota control
Infiniband topology aware scheduling
Support inference job type

WebPortal

New webportal in ReactJS and Koa
GPU fragmentation histogram on job submission page
Idle GPU count, monthly booked GPU hours and monthly idle GPU hours per user
Per VC GPU usage dashboard

Fundamental

Linux kernel upgrade from 4.x to 5.x
K8s upgrade from v1.9.0 to v1.15.2
NVIDIA driver upgrade to 430 series

Assets 3

07 Aug 18:50

Anbang-Hu

v1.1.0

1375359

Deep Learning Training Service v1.1.0

Jobmanager

Refactor jobmanager code.
Support distributed non-interactive (batch) job.
Restart failed jobs due to machine failure.
Restart preemptible jobs.
Support GPU packing and master-worker collocation (for distributed jobs) with k8s pod affinity.

Monitoring

GPU aggregated ECC error detection and alerting.
Cluster-wide GPU utilization dashboard.
Performance metrics on database connection latency for jobmanager and restfulapi server.
Configurable auto-kill (with email) for idle jobs in each virtual cluster.

Web Portal

Enable preemptible jobs.
Clean up advanced tab in job submission page.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Manager

Restful API

Dashboard

Monitoring and RepairManager

Storage Manager

Lustre

(Azure) AllowList Manager

Job Manager

Restful API

Monitoring

Dashboard

Deployment

Insight

Job Manager

Restful API

Dashboard

Logging

Restful API and MySQL DB

Monitoring

Storage Manager

Repair Manager

Fundamental

Job Manager

Restful API

Web Portal (Dashboard)

User Synchronization

Storage Manager

Repair Manager

Fundamental

JobManager

WebPortal

Fundamental

Jobmanager

Monitoring

Web Portal

Releases: microsoft/DLWorkspace

Deep Learning Training Service v1.7.0

Job Manager

Restful API

Dashboard

Monitoring and RepairManager

Storage Manager

Lustre

(Azure) AllowList Manager

Deep Learning Training Service v1.6.0

Job Manager

Restful API

Monitoring

Dashboard

Deployment

Insight

Deep Learning Training Service v1.5.0

Job Manager

Restful API

Dashboard

Logging

Restful API and MySQL DB

Monitoring

Storage Manager

Repair Manager

Fundamental

Deep Learning Training Service v1.4.0

Job Manager

Restful API

Web Portal (Dashboard)

User Synchronization

Storage Manager

Repair Manager

Fundamental

Deep Learning Training Service v1.2.0

JobManager

WebPortal

Fundamental

Deep Learning Training Service v1.1.0

Jobmanager

Monitoring

Web Portal