Skip to content

Commit

Permalink
Update release information and version number. (#319)
Browse files Browse the repository at this point in the history
  • Loading branch information
JamesTheZ authored May 11, 2022
1 parent e371ff7 commit 524a60c
Show file tree
Hide file tree
Showing 5 changed files with 96 additions and 4 deletions.
92 changes: 92 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Release 0.2.0

# Performance Optimization

## GPU stitch fusion

Make use of GPU shared memory to fuse reduce operator with its consumers into one kernel.
It helps to accommodate complex memory-intensive computations (e.g., LayerNorm, SoftMax) into one kernel,
reducing off-chip memory traffics and overhead of kernel scheduling and launching.
It implements partial functions described in paper [AStitch](https://dl.acm.org/doi/abs/10.1145/3503222.3507723).
It is currently under refactoring to enhance the robustness, for which it is not enabled by default.
Users of BladeDISC can enable it by setting the environment variable `DISC_ENABLE_STITCH=true`.

Note that we have already released the CPU stitch optimization when we open-source the BladeDISC project, which is enabled by default.
Refer to the [materials](https://bladedisc.oss-cn-hangzhou.aliyuncs.com/docs/performance-optimization-practice.pdf) for more information about CPU stitch technique details.

## GEMM merging

Support two types of GEMM merging optimization.
One is to merge two GEMMs sharing the same operand into a single GEMM.
The other one is to merge two GEMMs with the same shape into a batched GEMM.
The GEMM merging optimization helps to increase hardware utilization and to reduce kernel launch overhead.

## CPU GEMM/Convolution weight pre-packing optimization

Support weight pre-packing optimization for convolution (calling onednn library) and GEMM (calling mkl/onednn/acl libraries) operations.

## Convolution layout optimization and transpose elimination

Support to transform the layout of convolution operator to the friendliest format on the specific device (i.e., either CPU or GPU).
Most of the introduced transpose operators can be eliminated in a following transpose-simplifier pass.

## Other optimizations
* Optimize the schedule selection strategy for reduce operator on GPU to enhance thread-level-parallelism.
* Algebraic simplification for operators like power.
* Support to fuse splat constant operator with its consumers, reducing memory access overhead.
Refer to [issue](https://github.com/alibaba/BladeDISC/issues/113).

# Function Enhancement

## CPU end-to-end optimization

Support end-to-end optimization for X86 and AArch64 CPUs.

## TorchBlade/TensorFlowBlade clustering and optimizing with TensorRT

According to the supported operators of TensorRT, cluster sub-graphs and apply TensorRT optimization for both TensorFlow and PyTorch models.

## Accelerating PyTorch Training

Release PoC version for accelerating PyTorch training via Disc + Lazy Tensor Core,
referring to the related [issue](https://github.com/alibaba/BladeDISC/issues/156) and [design doc](https://github.com/alibaba/BladeDISC/blob/main/docs/design/ltc_disc.md).

## Shape analysis and simplifier enhancement

Enhance the shape equality analysis according to the dimension values.
Add the function to analyze the collapse and expand relationship between dimensions,
which helps to identify the dimension mapping between input and output values of reshape operator.
This is the basic function to support GPU stitch fusion.

## Codegen support for int8 datatype

Support int8 datatype for the code generation of memory-intensive operators (e.g., element-wise, reduce operators).

# Toolchain Support and Process Optimization

## Replay tool
Support to dump clusters and the corresponding input data, based on which developers can replay the execution.
It is effective to help debugging and tuning.
Refer to [issue](https://github.com/alibaba/BladeDISC/issues/76).

## CI optimization
Enhance the CI process of BladeDISC repo, which helps the people from community to contribute to BladeDISC more conveniently and efficiently.

## TorchBlade bazel build
Migrate TorchBlade's compilation toolchain from the original CMake to bazel, enhancing maintainability.

# Other

## Example preparation

Prepare a set of commonly used models as the examples for BladeDISC.
Compare the performance of BladeDISC with TensorRT, XLA and ONNX Runtime (ORT) upon the examples.

## Community TF rebase

Rebase to TensorFlow codebase for BladeDISC according to the newest community code.

## Code maintenance

Continuous bug fixing and code refactoring.

2 changes: 1 addition & 1 deletion pytorch_blade/release_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = '0.0.0'
__version__ = '0.2.0'
2 changes: 1 addition & 1 deletion tao/python/blade_disc_tf/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = '0.1.0'
__version__ = '0.2.0'
2 changes: 1 addition & 1 deletion tao/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# Package meta-data.
NAME_PREFIX = 'blade-disc'
DESCRIPTION = 'TensorFlow wrapper for Blade DISC compiler.'
URL = 'https://github.com/pai_disc/aicompiler'
URL = 'https://https://github.com/alibaba/BladeDISC'
EMAIL = '[email protected]'
AUTHOR = 'Zhu Kai'
REQUIRES_PYTHON = '>=3.6.0'
Expand Down
2 changes: 1 addition & 1 deletion tensorflow_blade/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "0.0.0"
__version__ = "0.2.0"

0 comments on commit 524a60c

Please sign in to comment.