Skip to content

Version 0.9.0

Compare
Choose a tag to compare
@jackalcooper jackalcooper released this 04 Jan 01:58

Version 0.9.0

OneFlow v0.9.0 release note

OneFlow v0.9.0 came out, welcome to install the new version for a better experience.

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Performance
  • Improvements
  • Bug fixes
  • Documentation
  • Edge Tools

Highlights

This update contains 640 commits and the following highlights:

  • With the addition of 86 new API interfaces and operators aligned with PyTorch and the fix of 104 bugs related to operator compatibility, OneFlow v0.9.0 provides better PyTorch API and model compatibility. In v0.9.0, users can migrate more PyTorch models to OneFlow with one click and gain faster performance.

    • Allowing one-click migration of Stable Diffusion、GLM、YOLOv5 etc to OneFlow.

    • More convenient model migration. Oneflow.load supports loading the torch.save models directly.

    • With the newly added oneflow.mock_torch module and mock method, oneflow can migrate complex PyTorch models containing multiple scripts with one click without changing the original PyTorch script.

  • Global Tensor has added a series of interfaces and methods that are convenient for distributed programming, and fixed known related bugs.

  • The Graph released a new feature of automatic parallelism (version 1), which supports automatic search for the fastest SBP with a specified Placement. When writing distributed models with Global Tensor, users do not need to consider parallelism.

  • The Graph adds a series of optimizations related to memory, execution speed, pipeline masking, and compilation speed to improve performance and reduces memory overhead.

  • The Graph provides a series of functions to aid debugging, including analyzing memory logs, displaying the progress during the compilation stage, and the computation graph.

  • OneFlow IR provides more compilation optimization functions.

  • The error prompt of OneFlow is more user-friendly, which supports highlighting the error content and simplifies unnecessary information details inside the system. In this connection, you can visually learn about the location and type of the error.

  • A series of operator optimizations and system optimizations have been added, including Eager instruction scheduling, high-performance CUDA kernel, opening up of multiple memory pools, etc.

Backwards Incompatible Change

  • To solve the possible duplicate name conflict between Graph.Block.config and module user-defined attribute module.config, OneFlow redesigned the abstraction of Graph proxy Module/Tensor, thus introducing a breaking change: (#9351 , https://github.com/Oneflow-Inc/oneflow/pull/9437,https://github.com/Oneflow-Inc/oneflow/pull/9607)

    • The attr and config attributes on Block are removed, and Block is renamed to Proxy;

    • Implementation plan: When added as members of nn.Graph, the original Eager Module and Tensor types will be packaged into the Proxy class, and the corresponding GraphModule and GraphTensor will be generated; nn.Graph will use Proxy in the subsequent composition For proxy execution, when the proxy is executed, the original eager type and graph type can be obtained from the Proxy. The naming refers to the naming of torch.fx.

    Eager primitive type Graph type, base class Graph Block Proxy execution type, the base class is called Proxy
    Function Supporting to get the original eager type A Graph code block corresponding to GraphBlock stores the information required for graph execution, such as name/scope/lazy op or tensor and optimization switches of some sub-modules on the graph. Proxy execution capability, using the same execution interface as Module and Tensor, but the behavior has changed, such as lazy, and the op that may be executed has also been rewritten.
    Module type Module GraphModule ProxyModule contains a Module member and a GraphModule member
    Tensor type Tensor GraphTensor ProxyTensor contains a Tensor member and a GraphTensor member
    • Here is an exmaple:
    import oneflow as flow
    import oneflow.nn as nn
    from oneflow.nn.graph import GraphModule
    linear = flow.nn.Linear(3, 8, False)
    class LinearGraph(nn.Graph):
        def __init__(self):
            super().__init__()
            # The type of linear is nn.Module. When added as an attribute of nn.Graph, it will be registered with nn.Graph.
            # self.linear has been wrapped as a ProxyModule.
            #self.linear.weight has been wrapped as a ProxyTensor.
            #nn.Graph will use ProxyModule to perform graph composition.
            self.linear = linear
            # There are two parts in ProxyModule, one is the original module and the other is GraphModule.
            self.linear.to(GraphModule)  # Get the corresponding GraphModule, on which you can do configuration related to graph optimization.
            # such as setting a pipeline stage for a module, and enabling pipeline parallelism. 
            self.linear.to(GraphModule).set_stage(id, placement)
            self.linear.to(nn.Module)  # get the corresponding original nn.Module.
            self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor.

Outdated interface in OneFlow v0.8.0:

import oneflow as flow
import oneflow.nn as nn
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.config.set_stage(id, placement)  # set stage
        self.linear.config.activation_checkpointing = True  # set activation checkpointing
        self.linear.origin  # get the corresponding original nn.Module
        self.linear.weight.origin # get the corresponding original Tensor

New interface in OneFlow v0.9.0:

import oneflow as flow
import oneflow.nn as nn
from oneflow.nn.graph import GraphModule
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.to(GraphModule).set_stage(id, placement)  # set stage
        self.linear.to(GraphModule).activation_checkpointing = True  # set activation checkpointing
        self.linear.to(nn.Module)  # get the corresponding original nn.Module
        self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor

New Features

Graph

  • Adds automatic parallelization feature for the first stage in Graph: (#8891, #9172 , #9288)

    • Automatic parallelism can be enabled by configuring self.config.enable_auto_parallel(True) in Graph. After it is enabled, you don't have to configure sbp, and the Graph will automatically find the optimal sbp combination.

    • Here is an exmaple:

    import oneflow as flow
    class SubclassGraph(flow.nn.Graph):
        def __init__(self):
            super().__init__() # MUST be called
            # auto parallelism configuration
            self.config.enable_auto_parallel(True)
            # other configurations about auto parallelism
            # ......
    
        def build(self):
            pass
  • Graph supports straightened algorithm optimization with memory priority, reducing the memory life cycle of each Tensor by adjusting the execution sequence to reduce the peak value of memory. (#9094)

    • With self.config.enable_straighten_algorithm("MemoryFirst"), the straightened algorithm with memory optimization can be enabled.

    • The available modes are as follows: "MemoryFirst" / "SpeedFirst" / "Disable" / "OverlapCpuGpu"

    • At the same time, Graph adds the algorithm "OverlapCpuGpu" that make CPU and GPU kernel overlap with each other as much as possible. (#9278)

  • Graph provides generalized basic transmission, using nccl send/recv to realize fast communication for any NdSbp (2d, 3d,...), thus minimizing the transmission volume.(#8437 , #8783)

  • With autograd.Function, Graph is allowed to use custom op (#8843).

  • You can use the Graph Optimizer through param_group["lr_scale"], supporting configuring the learning rate for the parameter of each module/layer. (#9138)

  • Adds enable_multi_tensor_update optimization. Enabling by self.config.enable_multi_tensor_update(True), it will optimize the overhead of numerous broken parameters when updating the model. (#9209, #9252)

  • Adds enable_fused_model_update_cast optimization. Enabling by self.config.enable_fused_model_update_cast(True), it will speed up the training speed of the network by fusing Optimizer and fp16 cast when AMP is on. (#9209)

  • Graph supports non-uniform segmentation under ND-SBP. (#9310)

  • Graph supports LazyTensor's indexing feature.
    (#9334)

  • Adds enable_compress_memory interface. Enabling by self.config.enable_compress_memory(True), it will try to optimize the memory and iterate the video memory of the computation graph within a half hour. Finally, the minimum value close to the lower limit will be found. (#9509)

  • Adds oneflow.utils.global_view.global_mode. It supports smooth migration from single-GPU code to multi-GPU code. This global_mode will create a global context with on/off support. In addition, it will set the default placement and sbp under the context and support various grammar of LocalTensor such as Tensor.device and Tensor.to(device). The source op created in this context will automatically generate the GlobalTensor and populate the default placement and sbp. This context enables the logic of the local tensor in the module to convert to global logic in a non-invasive manner.

    • Here is an example:

    • import oneflow as flow
      from oneflow.utils.global_view import global_mode
      
      P_C = flow.placement("cpu", ranks=[0, 1])
      P = flow.placement("cuda", ranks=[0, 1])
      B = flow.sbp.broadcast
      S0 = flow.sbp.split(0)
      x = flow.ones((6, 8), placement=P_C, sbp=S0)
      
      with global_mode(True, placement=P, sbp=B):
          device = linear_dp.weight.device
          x = x.to(device) # global tensor to device
          out = linear_dp(x)
      
          # The local tensor will be converted to global
          sample = flow.randn(out.shape, device="cpu").to(device)

Debug

  • Provides comprehensive memory analysis logs V2.0 (#8565)

    • export GLOG_v = 3 enables the environment variable to see the full memory analysis log in oneflow.INFO.

    • Adds shape, dtype, life cycle, and order of application for release of all tensors in each memory block (Chunk, MemBlock), which helps to quickly find out whether the tensor that greatly affect occupied memory in each memory block is normal or not.

    • The Checkpointing pass provides a log, recording tensors with Checkpoint.

  • Adds time_util to record the execution time of each module, actual physical memory occupied, and virtual memory occupied. (https://github.com/Oneflow-Inc/oneflow/pull/9164,https://github.com/Oneflow-Inc/oneflow/pull/9245)

  • Graph will display the compilation progress bar when the rank 0 calculation Graph is compiled when enabling such environment variables as debug(0) and ONEFLOW_NNGRAPH_ENABLE_PROGRESS_BAR=1. (#9537)

  • The default log directory is removed (The directory will not be created and be written to log files by default.) The log directory print logs will be generated when in ONEFLOW_DEBUG_MODE=1. (#9552#9575)

Eager

  • Adds parameter map_location to oneflow.load to support the placement or device of the specified loading model Tensor. (#8666)

  • Adds the oneflow.async.thread to allow users to create a new thread for asynchronous programming. (#8866 , #9039 , #9270)

  • oneflow.save supports saving ddp Module objects directly. (#8856)

  • Adds oneflow.utils.checkpoint to support Checkpointing optimization under eager. (#9053)

  • With the newly added oneflow.mock_torch module and mock method, the effect of one-click migration to oneflow can be realized without changing the original script of import torch. The benefit of this method is that all you need to do is add a new line instead of modifying the imports of files one by one (#9160 , #9256 , #9442 , #9473). You can use it with the following code:

    • import torch
      from oneflow.mock_torch import mock
      mock()
      # torch code
      # ...
    • Supports mocks with scope, such as:

    • import torch
      from oneflow.mock_torch import mock
      with mock.enable():
          # torch code
          # ...
  • Supports autograd's backward graph visualization debug: When enabling ONEFLOW_DEBUG_MODE=1 environment variable, each backward computation will generate the AutogradEngine execution graph to the dot file in the log directory. As is shown in the figure, you can see the operators of backward execution and topologies, which provides an easy way for algorithm and R&D personnel to debug backward problems. (#9412)