AI workloads will be the critical/dominant workloads for cloud and edge computing. Current cloud/edge systems leverage existing hardware/software architecture to support new AI workloads, which limits the capability of AI training/inferencing and also increases the serving cost. More intelligent and tailored hardware/software frameworks and architectures are needed to facilitate AI workloads.
We (Centaurus AI SIG) conduct research across the whole stack, from the hardware accelerators, smart NIC, to container platform, and ML framework. In a short term, we focus on the resources management and acceleartion with latest features, to analyze and schedule AI workloads on existing systems with intelligent methods. In a long term, we explore new architecture to enable emerging applications on the cloud, orchestrate heterogenous resources, and new service models.
- Elastic platform with self-learning capability
- Elastic training, dynamic GPU allocation
- GPU utilization profiling, precise resource management
- GPU fine-grained sharing, optimized resource utilization
- Autonomous scheduler, continuous scheduling decision learning, policy improvement
- Intelligent data orchestration, locality first, prefetch and caching
- Optimized ML framework
- DataSet and DataLoader optimization to accelerate data ingestion
- Parallelism (data/model/pipeline) Optimization
- Hyperparameters auto tuning, architecture search
- Cloud-native redesign for emergeing applications
- Application specific interface and engine optimization
- Acceleration through distributed computing
Weekly AI SIG Meeting: Wednesday 1:30pm - 2:30pm Pacific Time (Weekly) Join Meeting | Meeting Summary
Alnair Open Source Community Meeting: Wednesday 6:00PM Pacifc Time (Biweekly) starting 7/21/2021 | Join Meeting | Meeting Summary