Releases: microsoft/DLWorkspace
Releases · microsoft/DLWorkspace
Deep Learning Training Service v1.7.0
Job Manager
- Support setting different scheduling policies per VC.
- RF: Runnable first. Large jobs waiting for resources do not block later small jobs.
- FIFO: First-in first-out based on job queue time. A large job waiting for resources can block later small jobs.
- Support setting max job running time (wall time) per VC. VC admins can adjust the setting for jobs.
- Support limiting number of interactive GPUs per VC.
- Support user global public keys, enabling users to access jobs in any cluster using their own private key.
- Requeue preempted jobs at the head of the job queue.
- Add an INIT process in jobs to manage signal broadcast and zombie process reap, propagating SIGTERM to user process.
- Delete very old jobs in small batches to avoid locking DB.
Restful API
- Allow specifying max retry count for each job.
- Support changing parameters per VC:
- Max job time
- Max number of interactive GPUs
- Scheduling policy
- Allow adding user IP for allowlist.
- VC quota management proportional to GPU/CPU.
Dashboard
- VC notification
- Show worker node count for pure CPU cluster.
- Add timeout column for jobs in View and Manage Jobs.
- Show insight message(s) on job details page for running jobs.
- Show repair message(s) on job details page for running jobs.
- Add Visual Studio Code (alpha) as an endpoint on job details page.
- Allow downloading full job logs.
- Allow specifying max retry count on job submission page.
- Show repair status for worker nodes.
- Show snapshot time on STORAGE tab.
- Support exporting STORGAE tab as csv.
- Add SETTINGS tab for VC admins to manage VC parameters.
- Add a hidden page for cluster admins to manage VC quota.
- Add My SSH Keys page for users to upload global public keys.
- Add My Allowed IP page for users to self-serve allowing their IP.
Monitoring and RepairManager
- Fix incorrect mapping for DCGM GPU metrics.
- Auto-manage repair cycle of nodes according to predefined set of rules.
- Add a Node Repair State dashboard for repair monitoring.
Storage Manager
- Delete an expired directory file-by-file to avoid locking NFS.
- Take ctime into consideration when expiring files.
Lustre
- Support default storage quota per person (with configurable hard/soft limit and grace period).
- Support multi-MDT in auto-deployment pipeline.
- Support grouping OSTs into pool, mapping pools to VCs to achieve performance isolation.
(Azure) AllowList Manager
- Periodically compare the current allowed user IPs in DB and in Azure NSG rule, and make changes accordingly.
- Expire user IPs after a specified number of days.
Deep Learning Training Service v1.6.0
Job Manager
- Use username to run inference worker command
- Support preempting inference jobs
- Support preempting running preemptible jobs
Restful API
- Override GPU type in job submission to avoid incorrect resource accounting by GPU type
Monitoring
- NVSM health metrics for DGX-2
- Add Prometheus aggregate rules for federation scrape
- Expose health and performance metrics in Lustre
- Remove data retirement in job-exporter metrics collection to avoid data missing
- Expose Infiniband metrics
- Distinguish metrics from preemptible jobs
- GPU hours at cluster, VC, user, and job level
- NFS storage usage by user
- Monitor job pod phase
- Add a centralized email sender
Dashboard
- Storage tab in cluster status
- End-to-end test in browser
Deployment
- Lustre integration in cloud init deployment pipeline
- Map old configs to cloud init format
Insight
- Provide insight for running GPU jobs at backend
Deep Learning Training Service v1.5.0
Job Manager
- Remove NFS dependency during job initialization
- Use FrameworkLauncher
- Generalize VC resource quota (including CPU, GPU, memory)
- Direct NFS client mount into job pods using k8s nfs plugin
- Accelerate job endpoint bring-up
- Fix double count bug in unschedulable GPU accounting
- Retry 3 times for failed job
Restful API
- Force entire node allocation for distributed jobs
Dashboard
- UX improvement on View and Manage Jobs V2
- Cluster Status V2 with heterogeneous SKU support and assigned GPU utilization display
Logging
- Switch from fluentd to fluent-bit in log collection
- Elasticsearch based logging for job logs
- Azure blob based logging for job logs
Restful API and MySQL DB
- Batch kill/pause/resume/approve jobs
- Fix the semantic of and further optimize job list retrieval
- Simplify DB by merging
job_priorities
table intojobs
table
Monitoring
- Non-idle GPU utilization
- Relay on docker state to report network metric when not using host network
Storage Manager
- Auto-expiration of paths under specified paths
- Remove double usage count for hard-links
Repair Manager
- Job migration out of VM with uncorrectable ECC error
Fundamental
- Cloud init deployment of the whole cluster
- Scaling up/down worker VMs
- Parallel integration tests
- Nvidia persistence daemon by default
- Reorder GPUs to avoid NVLink break in VM
Deep Learning Training Service v1.4.0
Job Manager
- Improve 95th percentile job creation (from job submission to "scheduling") time from 400s to 46s.
- Speed up job initialization by prebuilding and copying required apt packages from an init container
- Per-user password for ssh login for user jobs
- Azure blobfuse plugin(s) for a job
- Custom docker registry secret(s) for a job
- Scheduling jobs on pure CPU machines
- VC machine hard assignment
- Provide consistent environment variables for training in both interactive and non-interactive SSH
Restful API
- Improve 95th percentile latency for job info and permission related Restful APIs from 2000ms to <500ms.
Web Portal (Dashboard)
- Speed up page loading for "View and Manage Jobs" - "View and Manage Jobs V2"
- Dashboard as a Kubernetes service
User Synchronization
- Automate the user/group permission update process
Storage Manager
- Scan NFS and send alert email for over-sized (boundary) paths when NFS storage usage exceeds threshold.
Repair Manager
- Detect and send alert email for uncorrectable ECC errors
Fundamental
- Fix occasionally failed NFS mounting upon machine restart
Deep Learning Training Service v1.2.0
JobManager
- Priority-adjustable job scheduling (by VC admin)
- Job pausing and resuming
- VC level user quota control
- Infiniband topology aware scheduling
- Support inference job type
WebPortal
- New webportal in ReactJS and Koa
- GPU fragmentation histogram on job submission page
- Idle GPU count, monthly booked GPU hours and monthly idle GPU hours per user
- Per VC GPU usage dashboard
Fundamental
- Linux kernel upgrade from 4.x to 5.x
- K8s upgrade from v1.9.0 to v1.15.2
- NVIDIA driver upgrade to 430 series
Deep Learning Training Service v1.1.0
Jobmanager
- Refactor jobmanager code.
- Support distributed non-interactive (batch) job.
- Restart failed jobs due to machine failure.
- Restart preemptible jobs.
- Support GPU packing and master-worker collocation (for distributed jobs) with k8s pod affinity.
Monitoring
- GPU aggregated ECC error detection and alerting.
- Cluster-wide GPU utilization dashboard.
- Performance metrics on database connection latency for jobmanager and restfulapi server.
- Configurable auto-kill (with email) for idle jobs in each virtual cluster.
Web Portal
- Enable preemptible jobs.
- Clean up advanced tab in job submission page.