Oneflow

Latest version: v0.9.0

Safety actively analyzes 622389 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 6

11.2

stable 版本和 nightly 版本的 OneFlow 都支持了 CUDA 11.2 平台（cu112）

ONNX 模块独立仓库
ONNX 模块目前在新仓库 https://github.com/Oneflow-Inc/oneflow_convert_tools 中维护，OneFlow 主仓库中的 ONNX 相关的代码将在下个版本移除，具体细节可以看[《深度学习框架OneFlow是如何和ONNX交互的？》](https://mp.weixin.qq.com/s/WEwqr_qkC8ZlIkl49upObA) 一文。oneflow_convert_tools 目前是针对 OneFlow 的 lazy 模式开发，目前最新版本号为 v0.3.2，后面针对 eager 模式的 oneflow_convert_tools 版本号将从 0.4.0 开始

"下集预告"
在下一个版本的 OneFlow 中，将包含更全面的 PyTorch 兼容，包括更多更丰富的接口支持以及多 GPU 支持。同时，下个版本的 OneFlow 也将支持动静图转换的功能。敬请期待！

2.0

            placement=flow.placement("cuda", ranks=[0, 1]),
            sbp=flow.sbp.split(0))
>>> y = x.to_local()
>>> y.size()
oneflow.Size([1])
>>> y
tensor([1.], device='cuda:0', dtype=oneflow.float32)
\ tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1

</pre>

</td>
</tr>
</table>

Supporting redistribution of Global Tensor in clusters

With **Tensor.to_global** interface, you can redistribute the data of **Global Tensor** in clusters. The data can be distributed to another set of nodes and the way of distribution in this set of nodes can also be changed (i.e.change SBP). Redistribution usually generates inter-process data communication, but **Tensor.to_global** interface finely avoids complicated low-level communication details.

python
>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)

Each operator of OneFlow defines a set of SBP signatures for the input and output tensor. **Global Tensor** supports automatic redistribution to provide the required SBP signature of a certain interface. Just as the code shown below:

python
>>> import oneflow as flow
>>> x = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4,
placement=flow.placement("cuda", ranks=[0, 1]),
sbp=flow.sbp.split(1))
>>> z = x + y

When `x + y` is executed, since x is split along `0` dimension while y is split along `1` dimension, their local tensors at each device can not be added up directly. Therefore, x's SBP will be automatically converted to `flow.sbp.split(1)` or y's SBP will be converted to `flow.sbp.split(0)`, and the calculated result-z's SBP- is `flow.sbp.split(1)` or `flow.sbp.split(0)`.

Notes

- Global Tensor doesn't support mix-in with DDP interface currently.

- Global Tensor requires all devices to execute simultaneously, and the code that has branches would lead to process deadlock because of divergent execution paths. We will continue fixing this problem.

2. Continued improvement of nn.Graph's features

0.9.0

0.8.0

python
class Graph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.m_stage0 = flow.nn.Linear(8, 8, False).to_global(placement=P_0, sbp=B)
self.m_stage1 = flow.nn.Linear(8, 8, False).to_global(placement=P_1, sbp=B)
set_stage(stage_id, placement)
The Stage ID is numbered starting from 0 and increasing by 1.
The Placement is all tensors placement of this module.
self.m_stage0.config.set_stage(stage_id=0, placement=P_0)
self.m_stage1.config.set_stage(stage_id=1, placement=P_1)
self.config.set_gradient_accumulation_steps(4)

def build(self, x):
There will be automatically do tensor.to_global(placement) for all input tensor of this module.
So there is no need to write to_global() in/out of the module forward function.
y = self.m_stage0(x)
z = self.m_stage1(y)
return z

New Features

Graph

- Added new interfaces: `oneflow.env.init_rdma` and `oneflow.env.rdma_is_initialized` to delay turning on the RDMA, thus accelerating the network communications across multiple devices (Note: avoid using fork() after RDMA being turned on, for example, DataLoader’s `num_workers > 1` should be executed before `init rdma`). https://github.com/Oneflow-Inc/oneflow/pull/8415

- Graph provided new algorithm optimization interface: `graph.config.enable_straighten_algorithm` to optimize the execution order in computation graph, which maximizes the overlap between transferring and computation. With this interface, the data transfer speed witnesses a 0.6% rise in data parallelism mode and a 6% rise in model parallelism mode. (https://github.com/Oneflow-Inc/oneflow/pull/8347, https://github.com/Oneflow-Inc/oneflow/pull/8483, https://github.com/Oneflow-Inc/oneflow/pull/8495 )

- Optimized the implementation of clip grad in Graph to support `clip_grad_max_norm > 1.0` and provided configurable `clip_grad_norm_type`, which could only be set to `2` before but now can be set to `+/- inf`, `+/- 1`, `+/- 2`, `+/- 3`, and bigger p-norm values. See the reference from [here]( https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) (https://github.com/Oneflow-Inc/oneflow/pull/7548)

- Global tensor in Graph supported the `tensor.set_item` operation for invariable ops, for example, `mask[:, :len_keep] = 0` (https://github.com/Oneflow-Inc/oneflow/pull/7751)

- Graph exported `build_graph` and `compile_and_init_runtime` interfaces, allowing to compile the `pass` that was previously self-defined by users after building the graph, thus rewriting and optimizing the graph. The two interfaces also supported Graph to restore an external graph (job). (https://github.com/Oneflow-Inc/oneflow/pull/8168)

- Added the `RegisterJobPass` interface to support rewriting the self-defined external job pass graph. (https://github.com/Oneflow-Inc/oneflow/pull/8370)

- `oneflow.boxing.nccl.enable_use_compute_stream(True)` optimized supports for NCCL logical kernel:

- Added noncontiguous ReduceScatter kernel to support the conversion of `P -> S(i), (i > 0)` (https://github.com/Oneflow-Inc/oneflow/pull/8361)

- Supported the conversion of `B -> S` (https://github.com/Oneflow-Inc/oneflow/pull/8355)

- Enabled nccl send/recv primitives to support special SBP conversions (https://github.com/Oneflow-Inc/oneflow/pull/8318)

- Added the efficient fused kernel `oneflow.nn.FusedMLP`, which is controlled by `export ONEFLOW_FUNCTOR_DISABLE_FUSED_MLP = 0` (https://github.com/Oneflow-Inc/oneflow/pull/7391, https://github.com/Oneflow-Inc/oneflow/pull/8165, https://github.com/Oneflow-Inc/oneflow/pull/8217, https://github.com/Oneflow-Inc/oneflow/pull/8413)

Debug

- `Graph.debug` offered the new parameter: `max_stack_depth (default = 2)` to note the maximal stack depth of the Python stack where the op exists in Graph, making it convenient to locate the Python context for each op in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8028)

- Apart from supporting printing the input/output/variable info of modules in Graph, it also newly supported printing operator info constructed in module forward. (https://github.com/Oneflow-Inc/oneflow/pull/8135)

- Enabled `export ONEFLOW_DEBUG_MODE=true` and `export GLOG_v=3` to print the full memory log, which contains multi-level MemBlock info on each device (Total Memory-> Chunk -> MemBlock), Block that has exclusive memory, Eager Variable and other information. Besides, a lifecycle label was added in Regst to analyze each tensor's memory lifecycle.

- LightPlan provided a more simplified way to display Actor Graph, cutting down the cost of debug based on Plan. When `ONEFLOW_DEBUG_MODE = true `, a series of light plan files corresponding to each rank in Graph will be generated under the `log/local_rank_0/machine/` directory, containing simplified actor sub-graphs in each rank, and the filename is `GraphName_rank_i_light_plan`. (https://github.com/Oneflow-Inc/oneflow/pull/8396)

- The `print graph` method allowed to display the logic graph by Module, making the debugging more efficient in constructing graphs. (https://github.com/Oneflow-Inc/oneflow/pull/8131)

Eager

- Supported passing extra parameters when Optimizer ParamGroup is being built, meeting other special operation demands for LrScheduler. (https://github.com/Oneflow-Inc/oneflow/pull/7753)

- python
param_groups = [{"params": [model.parameters()], "excess_param": ...}]
optim = optim.Adam(param_groups, lr=0.1)

- Added the `oneflow.cuda.current_device` interface to return the device index of the current rank (https://github.com/Oneflow-Inc/oneflow/pull/7856)

- Added the `oneflow.utils.from_torch` interface to convert a PyTorch Tensor into an OneFlow Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7851)

- Added the `oneflow.utils.to_torch` interface to convert an OneFlow Tensor into a PyTorch Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7851)

- Added the `oneflow.cuda.empty_cache` interface to manually release memory https://github.com/Oneflow-Inc/oneflow/pull/8482)

- Added the `oneflow.roc_auc_score` interface on CPU, which is equivalent to `sklearn.metrics.roc_auc_score` (https://github.com/Oneflow-Inc/oneflow/pull/7951)

Tensor

- Provided the `Tensor.contiguous_` interface as the contiguous operation for the inplace version (https://github.com/Oneflow-Inc/oneflow/pull/8275)

- Added the `Tensor.local_to_global` and `Tensor.global_to_global` interfaces to separately implement different default check meta operations (https://github.com/Oneflow-Inc/oneflow/pull/8027)

- Global Tensor's Slice/SliceUpdate supported all nd_sbp inputs, and SliceUpdate fully supported the inplace operation and backpropagation (https://github.com/Oneflow-Inc/oneflow/pull/8313, https://github.com/Oneflow-Inc/oneflow/pull/8337, https://github.com/Oneflow-Inc/oneflow/pull/8344, https://github.com/Oneflow-Inc/oneflow/pull/8416)

Global Boxing

- Eager Global Tensor supported balanced spliter nd sbp eager boxing (https://github.com/Oneflow-Inc/oneflow/pull/7768)

- Supported executing Eager Slice Boxing on random devices, including non-CPU devices and non-CUDA-capable devices (https://github.com/Oneflow-Inc/oneflow/pull/8180)

OneEmbedding

For better recommendations, modern recommendation systems always rely on huge Embedding tables. Besides, frequent iterations of user data require model training to be fast enough.

OneEmbedding is a component designed for large-scale recommendation systems, and it's efficient, extensible, and highly flexible. The following are its advantages:

1. Hierarchical storage and dynamic capacity expansion: users can expand the capacity of the Embedding at much lower cost.

2. Mixed parallelism strategy: it supports easily extending the model to train it on multi-machine multi-GPU.

3. Embedding quantization for better communication: in the parallel scenario, communication data can be quantized to reduce the communication amount, thus accelerating the training.

4. Efficient data pipeline: the model parts that have no data dependency can be executed in advance, thus overlapping with other operations in time.

5. Automatic mixed precision training: data can be computed in FP16 to reduce the occupied memory, thus accelerating the training speed and ensuring high model convergence precision.

6. A collection of efficient CUDA ops for common operations in recommendation systems is available.

7. Flexible model building is supported.

See OneEmbedding API documentation from [here](https://oneflow.readthedocs.io/en/master/one_embedding.html).

PyTorch Compatibility

A collection of new functionalities and interfaces that are compatible with PyTorch 1.10.0 have been added.

Tensor

- Added the `Tensor.pin_memory` functionality, which supports changing the memory to pinned memory when the tensor is being created. (https://github.com/Oneflow-Inc/oneflow/pull/8073)

- Supported passing the `pin_memory` parameter when the tensor is being created. (https://github.com/Oneflow-Inc/oneflow/pull/8176)

- DataLoader supported `pin_memory` (https://github.com/Oneflow-Inc/oneflow/pull/8214)

- Added the`Tensor.is_pinned` attribute (https://github.com/Oneflow-Inc/oneflow/pull/8447)

- Added the `~Tensor` (invert) method to conduct logic NOT operation for each tensor with the dtype of .bool. (https://github.com/Oneflow-Inc/oneflow/pull/7899)

- Added the `Tensor.log2` method to get log<sub>2</sub> for each tensor. (https://github.com/Oneflow-Inc/oneflow/pull/7906)

- Added the `Tensor.new_zeros` method to generate a new tensor that has a shape of 0. (https://github.com/Oneflow-Inc/oneflow/pull/7937)

- Added the `oneflow.as_tensor` interface to convert the input data into a new tensor that shares data. (https://github.com/Oneflow-Inc/oneflow/pull/7855)

- Added the `Tensor.__array__` method. `np.array` supports to input oneflow tensor to construct `np.ndarry` object. (https://github.com/Oneflow-Inc/oneflow/pull/7970)

- Added the `Tensor.new_tensor` method to copy the input data to generate a new tensor. (https://github.com/Oneflow-Inc/oneflow/pull/7973)

- Added the `Tensor.half` method, which is equivalent to `tensor.to (oneflow.float16)` . (https://github.com/Oneflow-Inc/oneflow/pull/7971)

- Added the `Tensor.byte` method to generate a new uint8 tensor, and `tensor.byte()` is equivalent to `tensor.to(oneflow.uint8)`. (https://github.com/Oneflow-Inc/oneflow/pull/8053)

- Added the `Tensor.view_as` and `Tensor.new_empty` methods (https://github.com/Oneflow-Inc/oneflow/pull/8077)

- Added the `Tensor.type` method to implement corresponding cast and adding objects for `oneflow(.cuda).{Byte, Char, Short, Int, Long, Half, Float, Double}Tensor` (https://github.com/Oneflow-Inc/oneflow/pull/8129)

- Added the `Tensor.dot` method to compute the dot product of two 1D tensors, and this method is equivalent to `oneflow.dot`. (https://github.com/Oneflow-Inc/oneflow/pull/8520)

- Added the `oneflow.nn.init.orthogonal_` interface to initialize tensors (https://github.com/Oneflow-Inc/oneflow/pull/8009)

Operators

- Added the `oneflow.nn.Softshrink` op (https://github.com/Oneflow-Inc/oneflow/pull/7826)

- Added the `oneflow.nn.Threshold` op (https://github.com/Oneflow-Inc/oneflow/pull/7875)

- Added the `oneflow.nn.Hardshrink` activation function (https://github.com/Oneflow-Inc/oneflow/pull/7887)

- Added the `oneflow.isnan` and `oneflow.isinf` interfaces to decide the element in tensor is nan or inf (https://github.com/Oneflow-Inc/oneflow/pull/7943)

- The `oneflow.nn.functional.*` interface supported passing the `numpy scalar` parameter (https://github.com/Oneflow-Inc/oneflow/pull/7935)

- Added the `oneflow.nn.functional.cosine_similarity` op to calculate the cosine similarity of two tensors (https://github.com/Oneflow-Inc/oneflow/pull/8119)

- Added the `oneflow.nn.functional.conv_transpose1d`, the `oneflow.nn.functional.conv_transpose2d` op, and the`nn.functional.conv_transpose3d` op (https://github.com/Oneflow-Inc/oneflow/pull/7991)

- Added the `oneflow.unbind` interface to return a tuple of all slices along a given dimension (https://github.com/Oneflow-Inc/oneflow/pull/7730)

- Added the `oneflow.swapdims` interface to specify the swapping of two dimensions, and `oneflow.swapdims` is equivalent to NumPy’s `swapaxes`. (https://github.com/Oneflow-Inc/oneflow/pull/7659)

- Added the `oneflow.addcmul` op to execute an element-wise composite function: `out=input+value×tensor1×tensor2` (https://github.com/Oneflow-Inc/oneflow/pull/7282)

- Added the `oneflow.searchsorted` op (https://github.com/Oneflow-Inc/oneflow/pull/7949)

- Added the `oneflow.mm` op (https://github.com/Oneflow-Inc/oneflow/pull/8440)

- Added the `oneflow.tensordot` interface and offered a collection of cases of equivalent transformation operations (https://github.com/Oneflow-Inc/oneflow/pull/7968)

- Added the `oneflow.repeat_interleave` op to repeat the elements of the tensor, and this op is equivalent to `numpy.repeat` (https://github.com/Oneflow-Inc/oneflow/pull/8324)

- Added the `oneflow.amax` and `Tensor.amax` methods (https://github.com/Oneflow-Inc/oneflow/pull/7996)

- Added the `oneflow.median` and `Tensor.median` methods (https://github.com/Oneflow-Inc/oneflow/pull/8069)

- Added the `oneflow.normal` method and fixed the `Tensor.normal`method (https://github.com/Oneflow-Inc/oneflow/pull/7956)

- Added the `oneflow.amin` and `Tensor.amin` methods (https://github.com/Oneflow-Inc/oneflow/pull/8042)

- Added the `oneflow.mv` op and `Tensor.mv` method (https://github.com/Oneflow-Inc/oneflow/pull/8445)

Random

- Added new interfaces: `oneflow.cuda.manual_seed`, `oneflow.cuda.manual_seed_all`, `oneflow.seed`, `oneflow.manual_seed`, `oneflow.initial_seed`, `oneflow.get_rng_state`, `oneflow.set_rng_state` and improved the configuration of OneFlow random seed initialization. (https://github.com/Oneflow-Inc/oneflow/pull/7957 )

AutoGrad

- Added new interfaces: `oneflow.set_grad_enabled` and `oneflow.enable_grad` to enable or disable automatic gradient update for some of subgraphs. (https://github.com/Oneflow-Inc/oneflow/pull/8016)

- Supported the upstream gradient dtype of the autograd reverse operator is different from that of the input. (https://github.com/Oneflow-Inc/oneflow/pull/8233, https://github.com/Oneflow-Inc/oneflow/pull/8309)

- Supported the backward operator that does not capture any tensor to execute backward computation multiple times. (https://github.com/Oneflow-Inc/oneflow/pull/8031)

CUDA

- Added APIs for `oneflow.cuda.set_device` and `oneflow.cuda.synchronize`. (https://github.com/Oneflow-Inc/oneflow/pull/8322)

RNN

- Refactored the Module of RNN and migrated the implementation of Python layer splicing to C++, which greatly optimized the performance. Added modules related to RNNCell and modules aligned with the `torch.nn.utils.rnn` in functionality:

- Refactored modules: `RNN`, `LSTM`, and `GRU`
- Added modules: `RNNCell `, `LSTMCell`, `GRUCell`, and`oneflow.nn.utils.rnn`
- Supported and fixed RNN unit tests of local and global, and completed documentation.

Device

Supported heterogeneous equipment type: In order to cope with the complexity of different hardware, OneFlow, in line with the dependency inversion principle in software engineering, has introduced a hardware abstraction layer called **Execution Provider (EP)**. The hardware abstraction layer is composed of a series of interfaces, which are abstracted from the capabilities provided by the required hardware devices during the running of the framework. After the hardware abstraction layer is introduced, each modules can directly call the interface provided by the hardware abstraction layer, not the original hardware interface, to use the underlying hardware, so it's unneccessary to concern the specific details of the underlying hardware. When a new hardware device is introduced, because the hardware abstraction interface remains unchanged, all modules can adapt to the new hardware device without any modification. At the same time, when adapting new hardware for the framework, we do not need to pay attention to the specific implementation details of the framework. We only need to implement a series of interfaces according to the agreement of the hardware abstract interface and the actual situation of the hardware device, and then the hardware adaptation can be completed.

Execution Provider has defined a collection of runtime interfaces: device registration interface, device management interface, queue management interface, event management interface, and memory management interface.

Primitive

In addition to the runtime interfaces, the Execution Provider has also defined a set of computing interfaces called Primitive, which are used to describe the commonly-used computation in the deep learning framework, thus simplifying the development of operators in hardware adaptation. Compared with the runtime interfaces provided by the Execution Provider, the interfaces provided by Primitive are more loose and flexible. All interfaces are mutually independent, and each interface represents a specific computing capability provided by a certain hardware device. Similar to runtime interfaces, the abstraction of interfaces provided by Primitive is closer to the device side, and developers can carry out adaptation work without an in-depth understanding of OneFlow's mechanism. Developers need to implement all interfaces provided by Execution Provider when adapting runtime interfaces, but in the process of adapting Primitive, developers can selectively adapt according to the actual situation of the project.

- Added unit test of `ep::primitive` basic function (https://github.com/Oneflow-Inc/oneflow/pull/8099)

- Added `ep::primitive::constant_pad`, optimized performance, removed obsolete pad grad and used pad as the inverse of pad (https://github.com/Oneflow-Inc/oneflow/pull/8152)

- Used unary primitive interface instead of original implementation in Kernel (https://github.com/Oneflow-Inc/oneflow/pull/8270)

- Added environment variable ONEFLOW_EP_CUDA_CUBLAS_WORKSPACE_SIZE_MB to configure cublas workspace size (https://github.com/Oneflow-Inc/oneflow/pull/8478)

- Scalar logical kernel supported primitives (https://github.com/Oneflow-Inc/oneflow/pull/8531)

- Used primitives to implement logical not kernel (https://github.com/Oneflow-Inc/oneflow/pull/8544)

- Migrated all activation kernels to use primitive (https://github.com/Oneflow-Inc/oneflow/pull/8300)

- Bias add kernel supported primitive (https://github.com/Oneflow-Inc/oneflow/pull/8512)

- Decoupled OneDNN from `ep::primitive` CPU device and provided environment variable `ONEFLOW_ENABLE_ONEDNN_OPTS` to enable onednn to accelerate CPU primitive interface (https://github.com/Oneflow-Inc/oneflow/pull/8274)

Debug tools

- Saved the log independently for each rank to `log/local_rank_{i}` when launching multiple processes by launcher. (https://github.com/Oneflow-Inc/oneflow/pull/7825)

- Optimized the display of OF_PROFILER_RANGE_GUARD in nsys. (https://github.com/Oneflow-Inc/oneflow/pull/8121)

OneFlow-Profiler

OneFlow-Profiler is designed to collect various performance-related information during the execution flow of the framework. It can calculate the execution time of the operator or system components, the allocation of memory and DRAM, and can record the input and parameter information corresponding to the operator. This information can be used by developers to analyze which part brings the most overhead and implement some targeted optimizations.

- Added OneFlow-Profiler. (https://github.com/Oneflow-Inc/oneflow/pull/8047)

- Profiled the information of the CUDA operator. (https://github.com/Oneflow-Inc/oneflow/pull/8195)

- Profiled the bandwidth information of the operator. (https://github.com/Oneflow-Inc/oneflow/pull/8254)

- Added interfaces to collect bandwidth information and optimized code implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8332)

- Refined Profiler. (https://github.com/Oneflow-Inc/oneflow/pull/8332)

- Used [Kineto](https://github.com/pytorch/kineto) and [CUPTI](https://docs.nvidia.com/cuda/cupti/index.html) to profile the information of CUDA operator. (https://github.com/Oneflow-Inc/oneflow/pull/8417)

Auto-Test

- When the value check fails, the value of the input tensor and Paramter will be automatically printed, and the pseudo-code segment of the output program will be highlighted for debugging (https://github.com/Oneflow-Inc/oneflow/pull/8383)

AutoProf

AutoProf is a framework designed to test the performance of OneFlow and PyTorch operators. It can automatically test the operator performance and print a comparison table under different CPU threads and GPUs. At present, it has been applied to the development of some existed operators and all new operators. Its effect is shown below:

<img width="1440" alt="image" src="https://user-images.githubusercontent.com/11607199/179392721-ae1d1f69-38cb-4894-92e7-bafdc06fa1c5.png">

- Added auto speed comparison framework of operator AutoProf to automatically run op to test: (https://github.com/Oneflow-Inc/oneflow/pull/8207)

- The speed of OneFlow and PyTorch.

- The speed of CPU/GPU Kernel under different numbers of threads.

- Total end-to-end time with CPU Kernel.

- Optimized the display of AutoProf to save testing time. (https://github.com/Oneflow-Inc/oneflow/pull/8303)

- Supported API tests without actual kernel execution, and the time would be end2end. (https://github.com/Oneflow-Inc/oneflow/pull/8320)

- Supported AutoProf to measure kernel bandwidth. (https://github.com/Oneflow-Inc/oneflow/pull/8367)

IR

- Used Cast to remove pass. (https://github.com/Oneflow-Inc/oneflow/pull/7837 )

- Used MLIR to complete constant folding, combined the composition optimization of Conv and BN. (https://github.com/Oneflow-Inc/oneflow/pull/7799)

- Optimized constant folding in OneFlow C++ API. (https://github.com/Oneflow-Inc/oneflow/pull/8124)

- Provided fault tolerance checking for parsed module. (https://github.com/Oneflow-Inc/oneflow/pull/8299)

- Fixed the BUG of constant folding unit test. (https://github.com/Oneflow-Inc/oneflow/pull/8340)

- Supported IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8249)

- Added `oneflow_iree(python)` to CI. (https://github.com/Oneflow-Inc/oneflow/pull/8431)

- Removed redundant output_lbns in IR. (https://github.com/Oneflow-Inc/oneflow/pull/8409)

- Provided a conversion marker for Variable -> constant. (https://github.com/Oneflow-Inc/oneflow/pull/8412)

- Removed hardcoded properties in IR. (https://github.com/Oneflow-Inc/oneflow/pull/8420)

- Implemented AutoNHWC Pass and provided environment variable `ONEFLOW_MLIR_PREFER_NHWC`. Supported automatic conversion of common network data formats to channels last optimization and had a noticeable acceleration on NVIDIA graphics cards that support FP16. (https://github.com/Oneflow-Inc/oneflow/pull/7890)

Performance

Graph

- Optimized the speed and memory of GPT and BERT under 3-D parallelism:

- Performance optimization: `fused_scale_mask_softmax` operator supported broadcast input. Optimized the kernel implementation and performance of softmax under specific cols (1024). Optimized the incomplete GetSbp list of `fused_scale_mask_softmax` reverse operator. (https://github.com/Oneflow-Inc/oneflow/pull/8321)

- Communication optimization: Optimized the communication cost of SBP cost under `B->S`, `B->B`, `B->P`. (https://github.com/Oneflow-Inc/oneflow/pull/8378)

- Interface optimization: Optimized the inefficient edge connection problem caused by the misalignment of stage id and to_global sequence dependency when using pipeline stage. (https://github.com/Oneflow-Inc/oneflow/pull/8442)

- Communication optimization: `nccl_use_compute_stream` supported more comprehensive sbp conversions like `P -> S(i)`. (https://github.com/Oneflow-Inc/oneflow/pull/8361)

- Communication optimization: Parallel use of RDMA communication. (https://github.com/Oneflow-Inc/oneflow/pull/8415)

- Memory optimization: Eliminated the randomness of the memory multiplexing algorithm, so that the memory multiplexing effect of each rank is consistent when the subgraphs are the same. There will be no bad case. (https://github.com/Oneflow-Inc/oneflow/pull/8441)

- Memory optimization: Removed the extra buffer problem of Stage 0 CPU copy under Pipeline parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/8484)

- Memory optimization: Under Checkpointing and Pipeline, the input identity of the module was de-duplicated to reduce additional Checkpointing tensor, and added the block name prefix of the module to the identity. (https://github.com/Oneflow-Inc/oneflow/pull/8509)

- Combination Optimization: ZeRO-DP supported using with Pipeline parallel and 3-D parallel. (https://github.com/Oneflow-Inc/oneflow/pull/8464)

- Memory optimization: Removed extra identity tensor in ZeRO optimization. (https://github.com/Oneflow-Inc/oneflow/pull/8407)
- Provided new environment variable optimization switches: `ONEFLOW_ENABLE_MULTI_TENSOR_MODEL_UPDATE ` and `ONEFLOW_FUSE_MODEL_UPDATE_CAST `. In the case of AMP, they supported the fusion of the Optimizer model update kernel and the next round of forward cast operators. (https://github.com/Oneflow-Inc/oneflow/pull/8373)

Eager

- Enabled `export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE =true` to accelerate the execution of Eager Global, which can save the synchronization of meta information on each rank of Global Tensor. (when users are confident that their code execution is symmetrical, SPMD)(https://github.com/Oneflow-Inc/oneflow/pull/7981)

> This environment variable is used to indicate whether the shape of the input data is the same when `local to global` is executed. If it is set to true, there is no need to synchronize the shape of each rank, and the logical shape is calculated locally.

- Used python c api to replace pybind11 to optimize the calling speed of tensor and functional.

- Optimized functional return types to save overhead and avoid reference copies. And solved the bug that the inplace tensor id may be inconsistent. (https://github.com/Oneflow-Inc/oneflow/pull/7985)

- Moved tensor API from pybind11 to c python API. Added tensor hash function. Resolves function naming conflict. (https://github.com/Oneflow-Inc/oneflow/pull/8258, https://github.com/Oneflow-Inc/oneflow/pull/8315, https://github.com/Oneflow-Inc/oneflow/pull/8342, https://github.com/Oneflow-Inc/oneflow/pull/8375)

- Performance optimization: Let vm worker threads concentrate on computing tasks, and decoupled memory tasks from computing tasks. (https://github.com/Oneflow-Inc/oneflow/pull/7976)

- Optimized the speed of operations in DataLoader, including `MakeLocalTensorFromData`, which is 20% faster under swin-T dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8066)

Operators & Tensor

- Optimized global `sparse_softmax_cross_entropy` kernel. (https://github.com/Oneflow-Inc/oneflow/pull/7298)

- Optimized and sped up CPU `permute` kernel with OneDNN. (https://github.com/Oneflow-Inc/oneflow/pull/7872)

- Optimized and sped up CPU `softmax` kernel with OneDNN. (https://github.com/Oneflow-Inc/oneflow/pull/8071 ， https://github.com/Oneflow-Inc/oneflow/pull/8075)

- Optimized the memory and speed required for the reverse calculation of the pooling kernel. (https://github.com/Oneflow-Inc/oneflow/pull/7980)

- Optimized Slice and Tensor getitem operations based on View to improve the speed of dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8148, https://github.com/Oneflow-Inc/oneflow/pull/8211, https://github.com/Oneflow-Inc/oneflow/pull/8243)

- Optimized the reverse composition logic of `flip` and `cumsum`, and remove some grad operators. When testing Grad diffs, used random value tests to increase test robustness. (https://github.com/Oneflow-Inc/oneflow/pull/8155)

- Optimized the memory usage of the `NormalizationAddReluGrad` operator and added versions that does not require addend_diff. (https://github.com/Oneflow-Inc/oneflow/pull/8213)

- Optimized and sped up the implementation of `tensor.reshape` and `tensor.reshape_as` from python implementation to c++ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8304)

- Converted `tensor.view`, `tensor.view_as`, `tensor.permute`, `tensor.transpose`, `tensor.contiguous_` from python implementation to c++ implementation. (https://github.com/Oneflow-Inc/oneflow/pull/8317)

- Greatly optimized the performance of `index_select` and `repeat_interleave` by using gather to replace dim gather. (https://github.com/Oneflow-Inc/oneflow/pull/8360)

- Optimized and removed temporary memory in cumprod cpu grad kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8369)

- The `embedding` operator supported amp, improved the performance under normal path, and fixed the bug that the gather cpu kernel memory out of bounds. (https://github.com/Oneflow-Inc/oneflow/pull/8374)

- Optimized the performance of `Tensor.fill_`. (https://github.com/Oneflow-Inc/oneflow/pull/8283)

- Greatly optimized the performance of the broadcast element-wise binary family operators in reverse calculation. (https://github.com/Oneflow-Inc/oneflow/pull/8339)

- Added fusion operator BinaryCrossEntropyWithLogitsReduceMean. (https://github.com/Oneflow-Inc/oneflow/pull/8476)

- Added high-performance matrix multiplication Fused kernel based on cublasLt. (https://github.com/Oneflow-Inc/oneflow/pull/8462, https://github.com/Oneflow-Inc/oneflow/pull/8222, https://github.com/Oneflow-Inc/oneflow/pull/8063)

Primitive

- Lowered the elementwise.cuh template's requirement for pointer alignment.

Improvements

Graph

- Exported oneflow env to python and used python's objects to manage its lifecycle. (https://github.com/Oneflow-Inc/oneflow/pull/7792)

- Used Python's reference counting to control the life cycle of Graph and constructed strict and rich destruction test cases. (https://github.com/Oneflow-Inc/oneflow/pull/7857)

- Supported recycling independent threads that can no longer be reused when Graph is destructed. (https://github.com/Oneflow-Inc/oneflow/pull/7862)

- Changed the basic configuration of resource from one-time static effect to real-time effect. (https://github.com/Oneflow-Inc/oneflow/pull/8444)

- Consolidated the nccl_comm dynamically created by the Graph NCCL logical kernel into the runtime for initial creation to avoid the deadlock caused by the inconsistency between the creation order of each rank and the eager nccl comm creation order. (https://github.com/Oneflow-Inc/oneflow/pull/8263)

- Refactor optimization: Merged `nn.graph.util.IONode` , `nn.graph.util.IONodeType` into IOArgs. (https://github.com/Oneflow-Inc/oneflow/pull/8272)

- Refactor optimization: Renamed the global singleton Global object to the Singleton object. (https://github.com/Oneflow-Inc/oneflow/pull/8490)

- Refactor optimization: Removed gpu_device_num (https://github.com/Oneflow-Inc/oneflow/pull/8516)

- Refactor optimization: Removed outdated AvailableMemDesc concepts. (https://github.com/Oneflow-Inc/oneflow/pull/8145)

- Refactor optimization: Removed outdated Model IO Kernel logic. (https://github.com/Oneflow-Inc/oneflow/pull/8151)

- Refactor optimization: Replaced GpuDeviceNum with the actual number of devices to avoid coupling with specific device types. (https://github.com/Oneflow-Inc/oneflow/pull/8166)

Eager

- C++ is available now. You can manually trigger allocator gc on each stream (applicable in ZeRO)（https://github.com/Oneflow-Inc/oneflow/pull/8452）

- The execution of Eager VirtualMachine instruction is based on the execution of EP. (https://github.com/Oneflow-Inc/oneflow/pull/7923)

- Optimized and removed all redundant interfaces of ` Get(Ptr)OrThrow`. (https://github.com/Oneflow-Inc/oneflow/pull/7812)

- Added the validity check of `flow.save(global_dst_rank)`. (https://github.com/Oneflow-Inc/oneflow/pull/7964)

- Supported the backward function node to run multiple times if it does not capture any tensor. (https://github.com/Oneflow-Inc/oneflow/pull/8031)

- Added the `ThreadLocalCached` decorator to clear the cache in time to alleviate increasing memory. (https://github.com/Oneflow-Inc/oneflow/pull/7858)

- Added std for C++14::inclusive_scan/std::exclusive_scan implementations. (https://github.com/Oneflow-Inc/oneflow/pull/8128)

- Packaged the parameters required by the eager opkernel and pass them in each thread to solve some thread-unsafe problems. (https://github.com/Oneflow-Inc/oneflow/pull/7617)

- Eager Stream supports kernel computation on pinned memory. (https://github.com/Oneflow-Inc/oneflow/pull/8486)

- Introduced a tool class for dim range check to replace simplified Functor's various checking logic for dimensions. (https://github.com/Oneflow-Inc/oneflow/pull/8382)

- Refactoring and optimization: removed the Blob object in EagerBlobObject, which leads to redundant TensorView instructions. At the same time, in order to support ShapeView efficiently, the elem_cnt attribute has also been removed. (https://github.com/Oneflow-Inc/oneflow/pull/7895)

- Refactoring and optimization: extracted the algorithm used by BinAllocator to share dynamic memory pools

- Refactoring and optimization: `VectorAt` and `MapAt` functions uniformly use reference to pass parameters to solve the mixed use of reference interface and pointer interface. (https://github.com/Oneflow-Inc/oneflow/pull/8191)

- Refactoring and optimization: removed the cfg application on C++. (https://github.com/Oneflow-Inc/oneflow/pull/8158)

- Refactoring and optimization: removed the outdated code related to RemoteBlob in Single-Client. (https://github.com/Oneflow-Inc/oneflow/pull/8228)

- Refactoring and optimization: merged duplicate logic in eager boxing ccl and nccl boxing expr. (https://github.com/Oneflow-Inc/oneflow/pull/7930)

- Refactoring and optimization: removed cfg on Python and reduced the number of symbols to optimize the link speed of compilation.

- Refactoring and optimization: merged `symbol::IdCache` and `symbol::Storage`. (https://github.com/Oneflow-Inc/oneflow/pull/8331)

- Refactoring and optimization: introduced `llvm::SmallVetor` and used `oneflow::small_vector` instead of `fixed_vector`. Besides, we have optimized the implementation and usage of Shape and Stride. (https://github.com/Oneflow-Inc/oneflow/pull/8365 , https://github.com/Oneflow-Inc/oneflow/pull/8402)

- Refactoring and optimization: refactored ShapeView and Shape to eliminated duplication and inconsistencies. (https://github.com/Oneflow-Inc/oneflow/pull/8422)

- Refactoring and optimization: eager VirtualMachine has decoupled InstructionType's dependency on StreamType. (https://github.com/Oneflow-Inc/oneflow/pull/7607)

- Refactoring and optimization: removed the InstructionMsg class and merged all its functions and fields into the Instruction class. (https://github.com/Oneflow-Inc/oneflow/pull/7623)

Operators & Tensor

- Stride support:

- Tensor, UserOp and UserKernel in `user_op::` all supported stride attribute. (https://github.com/Oneflow-Inc/oneflow/pull/7829)

- `cast` supports stride. (https://github.com/Oneflow-Inc/oneflow/pull/8292)

- View support and optimization:

- Added a new input tensor to decide whether to support non-contiguous when making op definitions. Besides, we now support `transpose`, `permute`, `narrow`, `expand`, `expand_as`, `split`, `chunk`, `unfold_tensor`, `movedim`, `as_strided`, `select`, `swapaxes`, `T`, `t`, `hsplit`, `vsplit`, `tensor_split` none-contiguous view ops.(https://github.com/Oneflow-Inc/oneflow/pull/7813)

- Tensor slice used view operations by default.（https://github.com/Oneflow-Inc/oneflow/pull/8302）

- Automatically generated version status (Feature Stage) for OneFlow's API. (https://github.com/Oneflow-Inc/oneflow/pull/7945)

- Optimized CUDA memset to `cudaMemsetAsync`（https://github.com/Oneflow-Inc/oneflow/pull/7763）

- `LeakyReLU` supported inplace optimization. (https://github.com/Oneflow-Inc/oneflow/pull/8060)

- Added the following parameters to `nn.Embedding` interface: `padding_idx`, `max_norm`, `norm_type`, `scale_grad_by_freq`. (https://github.com/Oneflow-Inc/oneflow/pull/8110)

- Aligned PyTorch's `max_pool_1d`, `max_pool_2d`, `max_pool_3d`, `avg_pool_1d`, `avg_pool_2d`, `avg_pool_3d`, and distinguish old pooling kernel aligned with TensorFlow. (https://github.com/Oneflow-Inc/oneflow/pull/8111)

- VectorAt supported passing in non-const references: `JUST(VectorAt(vec, 1)) = 5;`. (https://github.com/Oneflow-Inc/oneflow/pull/8013)

- Reduced the uncommon kernel template specializations of layer norm. (https://github.com/Oneflow-Inc/oneflow/pull/8209)

- Modified the logic of `Tensor.numpy` to avoid extra memory growth when saving the model. (https://github.com/Oneflow-Inc/oneflow/pull/8449)

- Tensor str supported printing nd sbp. (https://github.com/Oneflow-Inc/oneflow/pull/8458)

- Slice supported SBP infer (S->P), and the semi-automatically deduced sbp was able to selecte the same sbp as expected in the reducible nd_sbp. (https://github.com/Oneflow-Inc/oneflow/pull/8536)

- When printing non-CPU and non-CUDA tensor, you must copy to cpu first and then print. (https://github.com/Oneflow-Inc/oneflow/pull/8548)

- Refactoring and optimization: decoupling user kernel and device tag. (https://github.com/Oneflow-Inc/oneflow/pull/8529)

- Refactoring and optimization: a series of kernels (`squeeze`, `reshape_like`, `flatten`, `expand_dims`, `reshape`, `amp_white_identity`, `identity`, `identity_buffer`, `parallel_cast`, `hierarchical_parallel_cast`, ` hierarchical_parallel_cast_like`) were refactored to CopyDataContentKernel https://github.com/Oneflow-Inc/oneflow/pull/8537

- Refactoring and optimization: removed obsolete `constant_pad1d` , `constant_pad2d` , `constant_pad3d` kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8113)

- Refactoring and optimization: removed obsolete old lazy `upsample` kernel implementation.(https://github.com/Oneflow-Inc/oneflow/pull/8188)

- Refactoring and optimization: removed obsolete message in shape proto and used sequential to represent stride. (https://github.com/Oneflow-Inc/oneflow/pull/8220)

- Refactoring and optimization: removed obsolete multiply kernel, whick was included in `broadcast_mul`. (https://github.com/Oneflow-Inc/oneflow/pull/8359)

- Refactoring and optimization: Renamed the shape in UserOp/Kernel to shape_view interface. (https://github.com/Oneflow-Inc/oneflow/pull/8433)

- Refactoring and optimization: removed oneflow gemm. (https://github.com/Oneflow-Inc/oneflow/pull/8499)

- Optimized the Maybe return type of such interfaces as Scalar.As(). (https://github.com/Oneflow-Inc/oneflow/pull/8348)

Device

- Code refactoring `ep::CpuDevice` (https://github.com/Oneflow-Inc/oneflow/pull/7911)

- Code refactoring: removed hard-coded special decision for device type like "cpu", "cuda" from system code. (https://github.com/Oneflow-Inc/oneflow/pull/8201)

- Removed all dnn-related interfaces from the old version of KernelUtil (Primitive will be used to replace those interfaces). (https://github.com/Oneflow-Inc/oneflow/pull/8141)

- Removed all interfaces related to mathematical calculation in the old version of KernelUtil (Primitive will be used to replace those interfaces). (https://github.com/Oneflow-Inc/oneflow/pull/8157)

- Removed incomplete special decision for 'cuda 'device type in scope util. (https://github.com/Oneflow-Inc/oneflow/pull/8173)

- Achieved delayed capture of CUDA Graph(https://github.com/Oneflow-Inc/oneflow/pull/8474)

- Code refactoring: removed cuda_event. (https://github.com/Oneflow-Inc/oneflow/pull/8493)

- Code refactoring: removed useless WITH_CUDA macro. (https://github.com/Oneflow-Inc/oneflow/pull/8562)

Tests

Eager Global Module Tests：

In 0.8.0, we have completed the ability of all kernels to deal with global tensor in distributed situation, and fixed many known bugs related to sbp. The global tensor worked efficiently and correctly at the kernel level. No matter how the distributed topology structure changes, the same algorithm logic can efficiently get mathematically consistent results, which greatly reduced the trouble of verifying correctness in the complex, diverse and asymmetric distributed parallel training process.

| module/functional op | PR |
| -------------------------------- | ------------------------------------------------------------ |
| abs | [Oneflow-Inc/oneflow7540](https://github.com/Oneflow-Inc/oneflow/pull/7540) |
| 0_dim_tensor | [Oneflow-Inc/oneflow7540](https://github.com/Oneflow-Inc/oneflow/pull/7540) |
| activation | [Oneflow-Inc/oneflow7540](https://github.com/Oneflow-Inc/oneflow/pull/7540) |
| adaptive_pool | [Oneflow-Inc/oneflow7563](https://github.com/Oneflow-Inc/oneflow/pull/7563) |
| addmm | [Oneflow-Inc/oneflow7565](https://github.com/Oneflow-Inc/oneflow/pull/7565) |
| add | [Oneflow-Inc/oneflow7204](https://github.com/Oneflow-Inc/oneflow/pull/7204) |
| affine_grid | [Oneflow-Inc/oneflow7578](https://github.com/Oneflow-Inc/oneflow/pull/7578) |
| arange | [Oneflow-Inc/oneflow7576](https://github.com/Oneflow-Inc/oneflow/pull/7576) |
| argmax | [Oneflow-Inc/oneflow7579](https://github.com/Oneflow-Inc/oneflow/pull/7579) |
| argmin | [Oneflow-Inc/oneflow7581](https://github.com/Oneflow-Inc/oneflow/pull/7581) |
| argsort | [Oneflow-Inc/oneflow7582](https://github.com/Oneflow-Inc/oneflow/pull/7582) |
| argwhere | [Oneflow-Inc/oneflow7584](https://github.com/Oneflow-Inc/oneflow/pull/7584) |
| avgpool | [Oneflow-Inc/oneflow7585](https://github.com/Oneflow-Inc/oneflow/pull/7585) |
| batch_gather | [Oneflow-Inc/oneflow7590](https://github.com/Oneflow-Inc/oneflow/pull/7590) |
| bernoulli | [Oneflow-Inc/oneflow7732](https://github.com/Oneflow-Inc/oneflow/pull/7732) |
| bmm | [Oneflow-Inc/oneflow7741](https://github.com/Oneflow-Inc/oneflow/pull/7741) |
| broadcast_like | [Oneflow-Inc/oneflow7742](https://github.com/Oneflow-Inc/oneflow/pull/7742) |
| cast | [Oneflow-Inc/oneflow7773](https://github.com/Oneflow-Inc/oneflow/pull/7773) |
| ceil | [Oneflow-Inc/oneflow7744](https://github.com/Oneflow-Inc/oneflow/pull/7744) |
| chunk | [Oneflow-Inc/oneflow7750](https://github.com/Oneflow-Inc/oneflow/pull/7750) |
| clamp | [Oneflow-Inc/oneflow7752](https://github.com/Oneflow-Inc/oneflow/pull/7752) |
| clip_grad | [Oneflow-Inc/oneflow7757](https://github.com/Oneflow-Inc/oneflow/pull/7757) |
| concat | [Oneflow-Inc/oneflow7204](https://github.com/Oneflow-Inc/oneflow/pull/7204) |
| conv1d | [Oneflow-Inc/oneflow7769](https://github.com/Oneflow-Inc/oneflow/pull/7769) |
| conv2d | [Oneflow-Inc/oneflow7771](https://github.com/Oneflow-Inc/oneflow/pull/7771) |
| conv3d | [Oneflow-Inc/oneflow7771](https://github.com/Oneflow-Inc/oneflow/pull/7771) |
| cumsum | [Oneflow-Inc/oneflow7772](https://github.com/Oneflow-Inc/oneflow/pull/7772) |
| deconv2d | [Oneflow-Inc/oneflow7772](https://github.com/Oneflow-Inc/oneflow/pull/7772) |
| diagonal | [Oneflow-Inc/oneflow7772](https://github.com/Oneflow-Inc/oneflow/pull/7772) |
| diag | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| div | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| dot | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| dropout | [Oneflow-Inc/oneflow7772](https://github.com/Oneflow-Inc/oneflow/pull/7772) |
| empty | [Oneflow-Inc/oneflow7508](https://github.com/Oneflow-Inc/oneflow/pull/7508) |
| eq | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| erfc | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| erf | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| expand | [Oneflow-Inc/oneflow7772](https://github.com/Oneflow-Inc/oneflow/pull/7772) |
| expm1 | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| eye | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| flatten | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| flip | [Oneflow-Inc/oneflow7496](https://github.com/Oneflow-Inc/oneflow/pull/7496) |
| floor | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| fmod | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| fold | [Oneflow-Inc/oneflow7772](https://github.com/Oneflow-Inc/oneflow/pull/7772) |
| greater_equal | [Oneflow-Inc/oneflow7421](https://github.com/Oneflow-Inc/oneflow/pull/7421) |
| greater | [Oneflow-Inc/oneflow7366](https://github.com/Oneflow-Inc/oneflow/pull/7366) |
| fused_bias_add_dropout | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| fused_bias_add_gelu | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| fused_scale_mask_softmax_dropout | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| fused_scale_mask_softmax | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| fused_scale_tril | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| fused_self_attention | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| fused_tril_softmax_mask_scale | [Oneflow-Inc/oneflow7867](https://github.com/Oneflow-Inc/oneflow/pull/7867) |
| gather_nd | [Oneflow-Inc/oneflow7880](https://github.com/Oneflow-Inc/oneflow/pull/7880) |
| gather | [Oneflow-Inc/oneflow7880](https://github.com/Oneflow-Inc/oneflow/pull/7880) |
| glu | [Oneflow-Inc/oneflow7880](https://github.com/Oneflow-Inc/oneflow/pull/7880) |
| grid_sample | [Oneflow-Inc/oneflow7881](https://github.com/Oneflow-Inc/oneflow/pull/7881) |
| groupnorm | [Oneflow-Inc/oneflow7885](https://github.com/Oneflow-Inc/oneflow/pull/7885) |
| masked_fill | [Oneflow-Inc/oneflow7457](https://github.com/Oneflow-Inc/oneflow/pull/7457) |
| masked_select | [Oneflow-Inc/oneflow7492](https://github.com/Oneflow-Inc/oneflow/pull/7492) |
| math_ops | [Oneflow-Inc/oneflow7461](https://github.com/Oneflow-Inc/oneflow/pull/7461) |
| matmul | [Oneflow-Inc/oneflow7465](https://github.com/Oneflow-Inc/oneflow/pull/7465) |
| maxpool | [Oneflow-Inc/oneflow7683](https://github.com/Oneflow-Inc/oneflow/pull/7683) |
| max | [Oneflow-Inc/oneflow7450](https://github.com/Oneflow-Inc/oneflow/pull/7450) |
| mean | [Oneflow-Inc/oneflow7650](https://github.com/Oneflow-Inc/oneflow/pull/7650) |
| meshgrid | [Oneflow-Inc/oneflow7533](https://github.com/Oneflow-Inc/oneflow/pull/7533) |
| min_max_observer | [Oneflow-Inc/oneflow7725](https://github.com/Oneflow-Inc/oneflow/pull/7725) |
| min | [Oneflow-Inc/oneflow7450](https://github.com/Oneflow-Inc/oneflow/pull/7450) |
| movedim | [Oneflow-Inc/oneflow7679](https://github.com/Oneflow-Inc/oneflow/pull/7679) |
| moving_average_min_max_observer | [Oneflow-Inc/oneflow7726](https://github.com/Oneflow-Inc/oneflow/pull/7726) |
| mul | [Oneflow-Inc/oneflow7717](https://github.com/Oneflow-Inc/oneflow/pull/7717) |
| narrow | [Oneflow-Inc/oneflow7647](https://github.com/Oneflow-Inc/oneflow/pull/7647) |
| negative | [Oneflow-Inc/oneflow7644](https://github.com/Oneflow-Inc/oneflow/pull/7644) |
| ne | [Oneflow-Inc/oneflow7642](https://github.com/Oneflow-Inc/oneflow/pull/7642) |
| nms | [Oneflow-Inc/oneflow7536](https://github.com/Oneflow-Inc/oneflow/pull/7536) |
| nonzero | [Oneflow-Inc/oneflow7645](https://github.com/Oneflow-Inc/oneflow/pull/7645) |
| normalize | [Oneflow-Inc/oneflow7635](https://github.com/Oneflow-Inc/oneflow/pull/7635) |
| ones_like | [Oneflow-Inc/oneflow7635](https://github.com/Oneflow-Inc/oneflow/pull/7635) |
| parital_fc | [Oneflow-Inc/oneflow7534](https://github.com/Oneflow-Inc/oneflow/pull/7534) |
| permute | [Oneflow-Inc/oneflow7635](https://github.com/Oneflow-Inc/oneflow/pull/7635) |
| prod | [Oneflow-Inc/oneflow7635](https://github.com/Oneflow-Inc/oneflow/pull/7635) |
| randint | [Oneflow-Inc/oneflow7508](https://github.com/Oneflow-Inc/oneflow/pull/7508) |
| rand | [Oneflow-Inc/oneflow7508](https://github.com/Oneflow-Inc/oneflow/pull/7508) |
| reshape | [Oneflow-Inc/oneflow7472](https://github.com/Oneflow-Inc/oneflow/pull/7472) |
| roi_align | [Oneflow-Inc/oneflow7794](https://github.com/Oneflow-Inc/oneflow/pull/7794) |
| scatter_nd | [Oneflow-Inc/oneflow7807](https://github.com/Oneflow-Inc/oneflow/pull/7807) |
| scatter_ops | [Oneflow-Inc/oneflow7807](https://github.com/Oneflow-Inc/oneflow/pull/7807) |
| sign | [Oneflow-Inc/oneflow7818](https://github.com/Oneflow-Inc/oneflow/pull/7818) |
| slice | [Oneflow-Inc/oneflow7818](https://github.com/Oneflow-Inc/oneflow/pull/7818) |
| softplus | [Oneflow-Inc/oneflow7818](https://github.com/Oneflow-Inc/oneflow/pull/7818) |
| sparse_softmax_cross_entr | [Oneflow-Inc/oneflow7298](https://github.com/Oneflow-Inc/oneflow/pull/7298) |
| split | [Oneflow-Inc/oneflow7277](https://github.com/Oneflow-Inc/oneflow/pull/7277) |
| sqrt_square_sum | [Oneflow-Inc/oneflow7277](https://github.com/Oneflow-Inc/oneflow/pull/7277) |
| squeeze | [Oneflow-Inc/oneflow7289](https://github.com/Oneflow-Inc/oneflow/pull/7289) |
| stack | [Oneflow-Inc/oneflow7289](https://github.com/Oneflow-Inc/oneflow/pull/7289) |
| stateful_kernel_with_cache | [Oneflow-Inc/oneflow7289](https://github.com/Oneflow-Inc/oneflow/pull/7289) |
| std | [Oneflow-Inc/oneflow7303](https://github.com/Oneflow-Inc/oneflow/pull/7303) |
| sub | [Oneflow-Inc/oneflow7303](https://github.com/Oneflow-Inc/oneflow/pull/7303) |
| sum | [Oneflow-Inc/oneflow7303](https://github.com/Oneflow-Inc/oneflow/pull/7303) |
| tensor_ops | [Oneflow-Inc/oneflow7307](https://github.com/Oneflow-Inc/oneflow/pull/7307) |
| tensor_scatter_nd_update | [Oneflow-Inc/oneflow7308](https://github.com/Oneflow-Inc/oneflow/pull/7308) |
| tile | [Oneflow-Inc/oneflow7322](https://github.com/Oneflow-Inc/oneflow/pull/7322) |
| transpose | [Oneflow-Inc/oneflow7332](https://github.com/Oneflow-Inc/oneflow/pull/7332) |
| tril | [Oneflow-Inc/oneflow7322](https://github.com/Oneflow-Inc/oneflow/pull/7322) |
| TripletMarginLoss | [Oneflow-Inc/oneflow7332](https://github.com/Oneflow-Inc/oneflow/pull/7332) |
| triu | [Oneflow-Inc/oneflow7882](https://github.com/Oneflow-Inc/oneflow/pull/7882) |
| unfold | [Oneflow-Inc/oneflow7883](https://github.com/Oneflow-Inc/oneflow/pull/7883) |
| unfold_tensor | [Oneflow-Inc/oneflow7883](https://github.com/Oneflow-Inc/oneflow/pull/7883) |
| unsqueeze | [Oneflow-Inc/oneflow7882](https://github.com/Oneflow-Inc/oneflow/pull/7882) |
| upsample | [Oneflow-Inc/oneflow7884](https://github.com/Oneflow-Inc/oneflow/pull/7884) |
| var | [Oneflow-Inc/oneflow7891](https://github.com/Oneflow-Inc/oneflow/pull/7891) |
| view | [Oneflow-Inc/oneflow7886](https://github.com/Oneflow-Inc/oneflow/pull/7886) |
| weight_norm | [Oneflow-Inc/oneflow7886](https://github.com/Oneflow-Inc/oneflow/pull/7886) |
| where | [Oneflow-Inc/oneflow7886](https://github.com/Oneflow-Inc/oneflow/pull/7886) |
| zeropad2d | [Oneflow-Inc/oneflow7886](https://github.com/Oneflow-Inc/oneflow/pull/7886) |

EP::Primitive

Completed some unit tests of Primitive `log_softmax`, `softmax`, `copynd`, `Memset`, `Memcpy`, `matmul`, `add`, binary, unary, `matmul`, `batch_matmul`, fill etc. (https://github.com/Oneflow-Inc/oneflow/pull/8132, https://github.com/Oneflow-Inc/oneflow/pull/8139, https://github.com/Oneflow-Inc/oneflow/pull/8137, https://github.com/Oneflow-Inc/oneflow/pull/8109, https://github.com/Oneflow-Inc/oneflow/pull/8143, https://github.com/Oneflow-Inc/oneflow/pull/8108, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8154, https://github.com/Oneflow-Inc/oneflow/pull/8118 ， https://github.com/Oneflow-Inc/oneflow/pull/8291)

Exception

Improve exception error handling

- Added `reshape` exception handling. (https://github.com/Oneflow-Inc/oneflow/pull/7847)

- Improved the error message of module when the input information does not match. (https://github.com/Oneflow-Inc/oneflow/pull/7918)

- Added the `MAYBE_NEED_ERROR_MSG_CHECK` environment variable to check whether the CHECK function of Maybe contains oneflow:: Error message. It is used to prompt developers to add error prompt message. (https://github.com/Oneflow-Inc/oneflow/pull/7955)

- Improved the exception error message of `gather` op.(https://github.com/Oneflow-Inc/oneflow/pull/7979)

- Improved `LayerNorm` error message. (https://github.com/Oneflow-Inc/oneflow/pull/8090)

- Optimized the error message when Eager and Graph encounter multiple inconsistent input placement in op. (https://github.com/Oneflow-Inc/oneflow/pull/8054)

- Improved the error message checking in activation-related kernel processing logic.(https://github.com/Oneflow-Inc/oneflow/pull/8080)

- Improved the error message in `tensor.to_global` and `tensor.to_local`. (https://github.com/Oneflow-Inc/oneflow/pull/8067)

- Improved the exception error message in the `dot` kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8051)

- Rewrited the exception check in `batch_matmul` kernel. (https://github.com/Oneflow-Inc/oneflow/pull/8186)

- Fixed the problem of exception error checking when Python parses arg. (https://github.com/Oneflow-Inc/oneflow/pull/8205)

- Improved the exception error checking logic of all array functor. (https://github.com/Oneflow-Inc/oneflow/pull/8116)

- Improved the exception error checking logic of all binary functor. (https://github.com/Oneflow-Inc/oneflow/pull/8161)

- Improved the exception error reporting logic in nn grad functor. (https://github.com/Oneflow-Inc/oneflow/pull/8210)

- Added error message when Graph.build is not reloaded. (https://github.com/Oneflow-Inc/oneflow/pull/8250)

- Added TypeError type and device-related error message. (https://github.com/Oneflow-Inc/oneflow/pull/8057)

- Improved the error message of Eager SliceBoxing. (https://github.com/Oneflow-Inc/oneflow/pull/8232)

- Improved the error message of broadcast op. (Improve the error message of broadcast op)

- Improved the error message of Eager Boxing when it is at runtime. (https://github.com/Oneflow-Inc/oneflow/pull/7926)

- Improved the error message of Tensor index. (https://github.com/Oneflow-Inc/oneflow/pull/8234)

- Improved the error message in nn.functor. (https://github.com/Oneflow-Inc/oneflow/pull/7910)

- Added check for Physical Shape when Graph compiles exec_graph. (https://github.com/Oneflow-Inc/oneflow/pull/8002)

- Added default error message for CUDA check. (https://github.com/Oneflow-Inc/oneflow/pull/8427)

- Added similar error checking information to add n calculation. (https://github.com/Oneflow-Inc/oneflow/pull/8495)

- Improved the error message of arg sort. (https://github.com/Oneflow-Inc/oneflow/pull/8513)

- Improved the error message of bias add. (https://github.com/Oneflow-Inc/oneflow/pull/8524)

- Improved the error message in autograd function. (https://github.com/Oneflow-Inc/oneflow/pull/8496)

- Improved the error message of batch gather. (https://github.com/Oneflow-Inc/oneflow/pull/8533)

- Improved the error message prompt of defense code in autograd. (https://github.com/Oneflow-Inc/oneflow/pull/8525 ， https://github.com/Oneflow-Inc/oneflow/pull/8541)

Build

- Supported CUDA 11.5, 11.6. (ttps://github.com/Oneflow-Inc/oneflow/pull/7852 , https://github.com/Oneflow-Inc/oneflow/pull/8423)

- Fixed the version of click at 8.0.0. (https://github.com/Oneflow-Inc/oneflow/pull/7967)

- Updated nccl version to 2.12.10. (https://github.com/Oneflow-Inc/oneflow/pull/7822)

- Default alignment pytorch version 1.10.0. (https://github.com/Oneflow-Inc/oneflow/pull/7019)

- Updated tvm oneflow frontend dependencies. (https://github.com/Oneflow-Inc/oneflow/pull/8048)

- Updated the version of LLVM/MLIR to support IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8068 ， https://github.com/Oneflow-Inc/oneflow/pull/8461)

- Fixed the version of protobuf between 3.9.2 to 4.0. (https://github.com/Oneflow-Inc/oneflow/pull/8198)

- Removed the cfg tool in cmake. (https://github.com/Oneflow-Inc/oneflow/pull/8218)

- The environment variable of CMAKE INTERPROCEDURAL OPTIMIZATION was enabled by default. (https://github.com/Oneflow-Inc/oneflow/pull/8237)

- Removed the XRT part in the OneFlow source code, and the OneFlow-XRT will be used as a third-party plugin for oneflow. (https://github.com/Oneflow-Inc/oneflow/pull/8273 ，https://github.com/Oneflow-Inc/oneflow/pull/8288)

- read more: https://github.com/Oneflow-Inc/oneflow-xrt

- Changed Liboneflow to dynamic library. (https://github.com/Oneflow-Inc/oneflow/pull/8312)

- Updated the version of clang-tidy to 14.0.4. Supports the following syntax now: NOLINT, NOLINTNEXTLINE, NOLINTBEGIN & NOLINTEND. (https://github.com/Oneflow-Inc/oneflow/pull/8306)

- Removed `EXTERNAL_INCLUDE_DIRS` , only builds with target. (https://github.com/Oneflow-Inc/oneflow/pull/8421)

- Removed obsolete linkages in cmake. (https://github.com/Oneflow-Inc/oneflow/pull/8426)

CI

Improve the running speed and stability of CI

- Supported CI to automatically upload built docs.(https://github.com/Oneflow-Inc/oneflow/pull/7894 https://github.com/Oneflow-Inc/oneflow/pull/7917)

- Added CI test for IREE. (https://github.com/Oneflow-Inc/oneflow/pull/8419)

- Printed the pip package in the container used to test in order to query version information easily. (https://github.com/Oneflow-Inc/oneflow/pull/7952)

- Optimized the old version of SpeedTest. (https://github.com/Oneflow-Inc/oneflow/pull/7871 https://github.com/Oneflow-Inc/oneflow/pull/7990 https://github.com/Oneflow-Inc/oneflow/pull/8035)

- Optimized the memory used by AutoTest. (https://github.com/Oneflow-Inc/oneflow/pull/7988)

- Adjusted the threshold of benchmark. (https://github.com/Oneflow-Inc/oneflow/pull/8043)

- Adjusted the timeout threshold. (https://github.com/Oneflow-Inc/oneflow/pull/8103)

- Optimized the warning output related to `__del__` in CI. (https://github.com/Oneflow-Inc/oneflow/pull/8049)

- Optimized the interval of gc to improve the test speed. (https://github.com/Oneflow-Inc/oneflow/pull/8138)

- Optimized the use of super Tensor in CI unit test to avoid gc too slow and slow down the running speed of CI. (https://github.com/Oneflow-Inc/oneflow/pull/8177)

- Optimized the number of CI build to improve the speed of build. (https://github.com/Oneflow-Inc/oneflow/pull/8229)

- Optimized CI workflow, stops all workflows when a job fails. (https://github.com/Oneflow-Inc/oneflow/pull/8255)

- Increased maximum parallelism 5 -> 10. (https://github.com/Oneflow-Inc/oneflow/pull/8259)

- Strict CI timeout-minutes. (https://github.com/Oneflow-Inc/oneflow/pull/8266)

- Supported optional multi-machine testing via the `need-test-distributed` tag. (https://github.com/Oneflow-Inc/oneflow/pull/8372)

- Tried to use a distributed test cache when testing on multiple machines. (https://github.com/Oneflow-Inc/oneflow/pull/8387/files)

- Optimized the test time of global test. (https://github.com/Oneflow-Inc/oneflow/pull/8468)

- Optimized the execution time of test_math_ops, test_loss, test_activation, test_tensor_part1, test_tensor_part2 and other eager test. (https://github.com/Oneflow-Inc/oneflow/pull/8494)

- Optimized test_convtranspose, test_einsum, test_sqrt_square_sum in expensive eager test. (https://github.com/Oneflow-Inc/oneflow/pull/8504)

Models

- Added the test of LiBai in CI. (https://github.com/Oneflow-Inc/oneflow/pull/7537, https://github.com/Oneflow-Inc/oneflow/pull/7929)

- Fixed the speed test for Swin-Transformer. (https://github.com/Oneflow-Inc/oneflow/pull/7840)

- Added the benchmark test for flow-vision.(https://github.com/Oneflow-Inc/oneflow/pull/7806, https://github.com/Oneflow-Inc/oneflow/pull/8024)

- Added compatibility tests for `conv_mixer`, `densenet`, `ghostnet`, `googlenet`, `inception_v3`, `mnasnet`, `rexnet`, `rexnet_lite`, `res2net`, `shufflenet_v2`, `squeezenet`, `convnext`, `crossformer`, `efficientnet`, `levit`, `mlp_mixer`, `poolformer`, `pvt`, `res_mlp`, `uniformer`, `swin_transformer`, `senet` and other models. Fixes such compatibility issues as conv2d module padding parameter does not support string; the parameter list of functional.layer_norm is not aligned; meshgrid does not support the input of list[tensor]; adds a interface for tensor.reshape_as. (https://github.com/Oneflow-Inc/oneflow/pull/7942)

- Fixed the bug of Swin-Transformer dataloader. (https://github.com/Oneflow-Inc/oneflow/pull/8037)

- Added single-node 4-Gpus tests for models such as InsightFace in oneflow_face repository. (https://github.com/Oneflow-Inc/oneflow/pull/8130)

Bug fixes

Graph

- Fixed the bug of nccl deadlock caused by CUDA kernel asynchronous launch limit for nccl logical kernel in 3-D parallelism. (https://github.com/Oneflow-Inc/oneflow/pull/7924)

- Fixed cycle import of scope and session. (https://github.com/Oneflow-Inc/oneflow/pull/7993)

- Used log_softmax + nll to make sparse_softmax_cross_entropy ms more stable numerically for calculating subgraphs. (https://github.com/Oneflow-Inc/oneflow/pull/7987)

- Fixed the bug that B2P boxing misses TaskEdge lbi. (https://github.com/Oneflow-Inc/oneflow/pull/8052)

- Fixed the problem that compilation fails due to eager free tensor is not in nn.Graph's job. (https://github.com/Oneflow-Inc/oneflow/pull/8114)

- Fixed the possible problem of SegmentFault caused by BlobDesc. (https://github.com/Oneflow-Inc/oneflow/pull/8252)

- Solved the bug of circular import in python 3.6. (https://github.com/Oneflow-Inc/oneflow/pull/8268)

- Solved the problem that Graph's input and parameter/buffer tensors fail to handle non-contiguous tensors.(https://github.com/Oneflow-Inc/oneflow/pull/8281)

- Solved the potential deadlock caused by inconsistent partial order execution of multiple ranks in 3-D parallelism. （https://github.com/Oneflow-Inc/oneflow/pull/8226）

- Fixed the bug that Ibverbs failed to start the environment due to incorrect mtu value in special network environment. (https://github.com/Oneflow-Inc/oneflow/pull/8451)

- Solved the potential deadlock caused by the partial order execution of each rank when the subsequent subgraph of GradAcc is inserted into the NCCL logical op; at the same time, traverse the subsequent subgraph of GradAcc more comprehensively to solve the problem of missing NCCL op. (https://github.com/Oneflow-Inc/oneflow/pull/8459)

- Fixed the bug that NCCL logical kernels does not support bool type. (https://github.com/Oneflow-Inc/oneflow/pull/8455)

- Fixed the bug of tensor detach and clone in Graph. (https://github.com/Oneflow-Inc/oneflow/pull/8498)

Eager

- Aligned `DataLoader.__next__` interface (https://github.com/Oneflow-Inc/oneflow/pull/7835)

- Fixed backtracking failure when calculating higher-order derivatives, which is caused by the capturing of forward detached tensors via ` AutoGrad`

- Fixed inadequate execution of the semantics of sync by Barrier Instruction (https://github.com/Oneflow-Inc/oneflow/pull/7702)

- Fixed memory leak caused by imperfect management of VM instruction count

- Fixed `getitem` when tensor device id is not in the current rank

- Fixed `global norm` error on gradient calculation for various placements when calling clip grad in pipeline parallelism in eager global mode (https://github.com/Oneflow-Inc/oneflow/pull/7879)

- Fixed possible int32 arithmetic overflow caused by `Shape.elem_cnt` (https://github.com/Oneflow-Inc/oneflow/pull/8178)

- Fixed incorrect results produced by `Module.to_global` when introducing parameters (https://github.com/Oneflow-Inc/oneflow/pull/8187)

- Fixed extra GPU memory usage in `flow.load` and `module.load_state_dict` (https://github.com/Oneflow-Inc/oneflow/pull/8301)

- Fixed extra GPU memory usage when Optimizer loads models (https://github.com/Oneflow-Inc/oneflow/pull/8310)

- Fixed the error occurs when loading models via `flow.load` in multi nodes (https://github.com/Oneflow-Inc/oneflow/pull/8314)

- Fixed instability of eager caused by the introduction of callback thread (https://github.com/Oneflow-Inc/oneflow/pull/8193)

- Fixed `tensor.from_numpy` interface to avoid memory leak when the input of numpy is non-contiguous tensor (https://github.com/Oneflow-Inc/oneflow/pull/8391)

- Fixed stack overflow when destructing the deep backward computational graph after recursion (https://github.com/Oneflow-Inc/oneflow/pull/8056)

Operators & Tensor

Global Tensor

- Fixed global SBP inference of `unfold` (https://github.com/Oneflow-Inc/oneflow/pull/7883)

- Fixed global SBP inference of `grid_sample` (https://github.com/Oneflow-Inc/oneflow/pull/7881)

- Fixed incorrect pass of values in slice boxing kernel in certain cases (https://github.com/Oneflow-Inc/oneflow/pull/7893)

- Fixed eager global inplace (https://github.com/Oneflow-Inc/oneflow/pull/7903)

- Fixed SBP inference of `upsample` op (https://github.com/Oneflow-Inc/oneflow/pull/7884)

- Fixed SBP inference of `ScatterAdd`, `ScatterUpdate`, and `ScatterScalarUpdate` (https://github.com/Oneflow-Inc/oneflow/pull/7807)

- Fixed backward memory error of `partial_fc` with Global Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8041)

- Added support for S0 in `randperm` and fixed equal local tensors across all ranks in random op in Split (https://github.com/Oneflow-Inc/oneflow/pull/7571)

- Fixed tensor getitem index error in global (https://github.com/Oneflow-Inc/oneflow/pull/8153)

- Fixed SBP inference of `RoiAlign` and added global unit test (https://github.com/Oneflow-Inc/oneflow/pull/7794)

- Fixed SBP inference of `stack` op (https://github.com/Oneflow-Inc/oneflow/pull/8181)

- Fixed random initialization in median under CPU global (https://github.com/Oneflow-Inc/oneflow/pull/8245)

- Fixed SBP inference of `narrow` op and added global unit test for `narrow` and `chunk` (https://github.com/Oneflow-Inc/oneflow/pull/7750)

- Improved legal SBP list of `batch_matmul` (https://github.com/Oneflow-Inc/oneflow/pull/8385)

- Fixed NLLLoss’ failure to support model parallelism (https://github.com/Oneflow-Inc/oneflow/pull/8380)

- Fixed S->S and S->P inference in Slice Op SBP infer (https://github.com/Oneflow-Inc/oneflow/pull/8521)

Tensor

- Fixed the bug occurs when Tensor dim is set to -1

- Fixed failure for Tensor type to be directly transferred to int and float in Python (https://github.com/Oneflow-Inc/oneflow/pull/7927)

- Fixed the bug in `Tensor.is_contiguous` that skips initialization when caching and executes random initialization when getting values (https://github.com/Oneflow-Inc/oneflow/pull/7785)

- Fixed the bug in Tensor slice view under 1d contiguous (https://github.com/Oneflow-Inc/oneflow/pull/7898)

- Fixed incorrect processing of None value by `Tensor.__eq__` (https://github.com/Oneflow-Inc/oneflow/pull/7938)

- Fixed unaligned memory size in `from_numpy` interface (https://github.com/Oneflow-Inc/oneflow/pull/7963)

- Fixed incorrect initialization of random seed in Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7904)

- Fixed failure of ` oneflow.Size` to create Tensor with a specified shape (https://github.com/Oneflow-Inc/oneflow/pull/8429)

- Aligned `alpha` parameter in `Tensor.add` (https://github.com/Oneflow-Inc/oneflow/pull/8140)

Scalar Tensor

- Fixed failure of `add` to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7827)

- Fixed failure of `reduce_sum` to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7866)

- Fixed failure of `one_hot` to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/7975)

Fixed failure of `gather` to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8376)

- Fixed “memory access out of bounds” error in `dim_scatter` kernel under Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8418)

- Fixed failure of start and end parameters in `arrange` op to support Scalar Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8522)

- Fixed failure of `all` to support Scalar Tensor and 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8547)

0-Size Tensor

- Fixed failure of `conv` and `deconv` to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8001)

- Fixed failure of `cuda_check_numerics` to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8050)

- Fixed failure of `expand` and `advanced_index` to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8094)

- Fixed the bug occurs when processing 0-Size Tensor in `repeat_interleave` kernel and removed relevant special judge in `gather` (https://github.com/Oneflow-Inc/oneflow/pull/8414)

- Fixed failure of `diag` to support 0-Size Tensor (https://github.com/Oneflow-Inc/oneflow/pull/8557)

Operators

- Fixed sorting in `nms` unit test (https://github.com/Oneflow-Inc/oneflow/pull/7831)

- Fixed torch alignment of beta and threshold interfaces of `softplus` op (https://github.com/Oneflow-Inc/oneflow/pull/7888)

- Fixed failure of `expand` to support passing tuples as parameters (https://github.com/Oneflow-Inc/oneflow/pull/7913)

- Fixed computation failure in `randperm` when n is too large (https://github.com/Oneflow-Inc/oneflow/pull/7908)

- Fixed failure relative to list or tuple in parameter passing in `meshgrid` (https://github.com/Oneflow-Inc/oneflow/pull/7933)

- Fixed `nn.functional.conv2d` bug that all parameters must be specified (https://github.com/Oneflow-Inc/oneflow/pull/7892)

- Fixed failure of `rand` and `randn` to support tuple as an input (https://github.com/Oneflow-Inc/oneflow/pull/7914)

- Fixed the bug occurs in `concat` when inputs are of inconsistent data types (https://github.com/Oneflow-Inc/oneflow/pull/7921)

- Fixed wrong device id got by generator in certain cases in `randn`,`dropout`, `randint`, `rand`, `random_mask_like`, and `randperm` (https://github.com/Oneflow-Inc/oneflow/pull/7896)

- Fixed inconsistent behaviors of `__shfl_sync` under `sm_61` in `layernorm` (https://github.com/Oneflow-Inc/oneflow/pull/7978)

- Fixed failure of `scatter` op to support negative dim (https://github.com/Oneflow-Inc/oneflow/pull/7934)

- Fixed the bug in `scatter` op nd update value(https://github.com/Oneflow-Inc/oneflow/pull/7953)

- Fixed failure of `masked_select` to support certain Broadcast operations in eager mode (https://github.com/Oneflow-Inc/oneflow/pull/7984)

- Fixed the bug in `PReLU` op when dispatching num_blocks (https://github.com/Oneflow-Inc/oneflow/pull/8004)

- Fixed misused numpy forced synchronization logic in `index_select` python and transferred the logic to functor for implementation (https://github.com/Oneflow-Inc/oneflow/pull/7965)

- Aligned dtype parameter in `prod` (https://github.com/Oneflow-Inc/oneflow/pull/7932)

- Fixed the bug occurs when `ord = 0` in `linalg.vector_norm` op; Fixed check on nan/inf by clip_grad (https://github.com/Oneflow-Inc/oneflow/pull/8007)

- Fixed failure of `min` and `max` to operate on inconsistent dtypes (https://github.com/Oneflow-Inc/oneflow/pull/8021)

- Added `num_batches_tracked ` buffer to `batch_norm` to facilitate transfer of ResNet-18, a torch pretrained model, to OneFlow (https://github.com/Oneflow-Inc/oneflow/pull/7920)

- Fixed the misuse of `logf`, `expf`, and `powf` in math kernel (https://github.com/Oneflow-Inc/oneflow/pull/8038)

- Fixed exclusion of dtype parameters in `cumsum` and `cumprod` and provided `Tensor.cumsum` and `Tensor.cumprod` methods (https://github.com/Oneflow-Inc/oneflow/pull/8065)

- Fixed possible overflow when dtype is not int64 in `non_zero` op (https://github.com/Oneflow-Inc/oneflow/pull/7907)

- Aligned `sum`, `mean`, `all`, `any`, and `prod` operations in `reduce` (https://github.com/Oneflow-Inc/oneflow/pull/8085)

- Fixed incorrect backward computation in `cumprod` (https://github.com/Oneflow-Inc/oneflow/pull/8136)

- Aligned `alpha` parameter in `sub` operation (https://github.com/Oneflow-Inc/oneflow/pull/8026)

- Fixed shape inference in `upsample` op (https://github.com/Oneflow-Inc/oneflow/pull/8105)

- Fixed failure of `addn` inplace operation on CPU tensor (https://github.com/Oneflow-Inc/oneflow/pull/8280)

- Fixed limit on tensor size in `cum` backward op based on the size of shared memory (https://github.com/Oneflow-Inc/oneflow/pull/8289)

- Improved the logic of dtype inference for `arange` op (https://github.com/Oneflow-Inc/oneflow/pull/8338)

- Fixed NaN propagation of UnaryFunctor (https://github.com/Oneflow-Inc/oneflow/pull/8346)

- Fixed ndim check of `pad` (https://github.com/Oneflow-Inc/oneflow/pull/8354)

- Fixed vector check in `broadcast_min` and `broadcast_max` backward computations (https://github.com/Oneflow-Inc/oneflow/pull/8379)

- Fixed the bug relative to index computation logic in `cumprod` op (https://github.com/Oneflow-Inc/oneflow/pull/8388)

- Fixed possible int32 overflow in `softmax` and math unary / binary cuda kernel; for kernels that operate integer division on `i` in `CUDA_1D_KERNEL_LOOP`, provided `if` statement to branch computations to prevent performance loss in most cases when int32 works (https://github.com/Oneflow-Inc/oneflow/pull/8472)

- Fixed failure to pass size via `size=(...)` in random ops (`normal`, `rand`, `randn`, `randint`, and `randperm`) (https://github.com/Oneflow-Inc/oneflow/pull/8506)

Device

- Fixed error in `cudaGetDeviceCount` when CUDA device count=0 (https://github.com/Oneflow-Inc/oneflow/pull/8184)

- Fixed possible unregistration of devices caused by `hob.ToString` method; Used static local variables to establish dependency between static variables of device registration and the static code for device registration (https://github.com/Oneflow-Inc/oneflow/pull/8235)

- Fixed `cudaErrorNoDevice` caused by drive errors (https://github.com/Oneflow-Inc/oneflow/pull/8262)

- Fixed memory leak caused by realpath (https://github.com/Oneflow-Inc/oneflow/pull/8540)

Higher order derivative

- Introduced AutogradCapturedTensor in backward computation to avoid circular reference and allow correct backtracking to the input gradient node in higher order derivative graph (https://github.com/Oneflow-Inc/oneflow/pull/7808)

- Added higher order derivative of `sin/cos` op; Fixed `autograd` bugs relative to higher order derivative (https://github.com/Oneflow-Inc/oneflow/pull/8163)

- Fixed bugs in backward computation in `concat` and `split_like` to support higher order derivative (https://github.com/Oneflow-Inc/oneflow/pull/8208)

Build

- Fixed RTD [sphinx] failure to build docstr (https://github.com/Oneflow-Inc/oneflow/pull/7901)

- Fixed compilation failure caused by opencv copy header failure (https://github.com/Oneflow-Inc/oneflow/pull/7944)

- Fixed failure to generate a new `.so` in compilation when `CMAKE_LINK_DEPENDS_NO_SHARED=YES` (https://github.com/Oneflow-Inc/oneflow/pull/7868)

- Fixed Eigen url in cmake third party (https://github.com/Oneflow-Inc/oneflow/pull/8223)

- Fixed the bug caused by multi-time linking to libof_protoobj in XRT (https://github.com/Oneflow-Inc/oneflow/pull/8326)

- Made libproto a dynamic library to avoid collision between static global variables (https://github.com/Oneflow-Inc/oneflow/pull/8345)

- Made `of_pyext_obj` static only when there is one Python extension dynamic library that has Python symbols (https://github.com/Oneflow-Inc/oneflow/pull/8393)

- Fixed the bug in `undefined symbol: del_curterm` in source code compilation (https://github.com/Oneflow-Inc/oneflow/issues/8398)

- Fixed false positive warning in gcc11 compilation (https://github.com/Oneflow-Inc/oneflow/pull/8401)

- Fixed SegFault that occurs when unzipping dataset in the container by making zlib a dynamic library (https://github.com/Oneflow-Inc/oneflow/pull/8481)

- Fixed undefined reference of culibosTlsSetValue (https://github.com/Oneflow-Inc/oneflow/pull/8479)

- Fixed stringop-truncation compilation error for gcc9 (https://github.com/Oneflow-Inc/oneflow/pull/8532)

CI

- Disabled static link of Simple CI and enabled debug build to avoid too many symbols (https://github.com/Oneflow-Inc/oneflow/pull/7940)

- Fixed the bug in AutoTest fake program; Fixed print error in AutoTest (https://github.com/Oneflow-Inc/oneflow/pull/8279; https://github.com/Oneflow-Inc/oneflow/pull/8290)

Module

- Disabled conv3d test temporarily for its relatively large error of random values (https://github.com/Oneflow-Inc/oneflow/pull/7969)

- Reduced test error in nn.LayerNorm (https://github.com/Oneflow-Inc/oneflow/pull/7941)

- Optimized input data range of certain math op tests (https://github.com/Oneflow-Inc/oneflow/pull/8010)

- Fixed incorrect unit test case in `permute` (https://github.com/Oneflow-Inc/oneflow/pull/8083)

- Aligned error message of chunk to torch (https://github.com/Oneflow-Inc/oneflow/pull/8096)

- Fixed incorrect use of `permute` in tensor tests (https://github.com/Oneflow-Inc/oneflow/pull/8144)

- Fixed omission of test cases in `instancenorm` (https://github.com/Oneflow-Inc/oneflow/pull/8215)

- Adjusted unit test threshold for `leaky_relu` (https://github.com/Oneflow-Inc/oneflow/pull/8242)

- Annotated cpu bn grad method that tests with random values (https://github.com/Oneflow-Inc/oneflow/pull/8257)

- Skipped test cases of `global argmax` and `median` in multi-GPU scenarios (https://github.com/Oneflow-Inc/oneflow/pull/8264)

- Adjusted unit test threshold for `fused_dot_feature_interaction` (https://github.com/Oneflow-Inc/oneflow/pull/8293)

- Disabled unit tests for `conv_transpose1d`, `conv_transpose2d`, and `conv_transpose3d` (https://github.com/Oneflow-Inc/oneflow/pull/8319)

- Adjusted tolerance setting in embedding_renorm unit test (https://github.com/Oneflow-Inc/oneflow/pull/8394)

- Removed test cases with excessive accumulated elements in `test_fused_dot_feature_interaction_pooling_sum` to avoid overly large sum error (https://github.com/Oneflow-Inc/oneflow/pull/8425)

Documentation

- Ensured that all PyTorch references in OneFlow API documentation belong to the same PyTorch version (1.10.0) (https://github.com/Oneflow-Inc/oneflow/pull/8058)

- Added "copy" button for code in API docs to facilitate trial runs of sample code (https://github.com/Oneflow-Inc/oneflow/pull/7997)

- Refined script that automatically generates version status for OneFlow APIs and fixed bugs in docs (https://github.com/Oneflow-Inc/oneflow/pull/8546)

- Refined interface documentation of Tensor and Module (https://github.com/Oneflow-Inc/oneflow/pull/7823)

- Refined `Tensor.to_global` interface documentation and added descriptions of ` gard_sbp`

- Refined ` Tensor.to_local` interface documentation

- Added Tensor Attributes docs for `oneflow.placement`, `oneflow.env.all_device_placement`, and `oneflow.sbp.sbp`

- Added interface documentation for `Module.to_consistent` (outdated) and `Module.to_global`

- Fixed invalid links in Tensor docs and updated `consistent` to `global` (https://github.com/Oneflow-Inc/oneflow/pull/7821)

- Added docstr for `Tensor.sqrt`, `Tensor.square`, `Tensor.addmm`, `Tensor.cosh`, `Tensor.diagonal`, `Tensor.log`, `Tensor.ndim`, and `Tensor.rsqrt` (https://github.com/Oneflow-Inc/oneflow/pull/7841)

- Enabled derived classes of pybind11 to add documentation for non-overriding methods and added interface documentation related to Tensor and autograd (https://github.com/Oneflow-Inc/oneflow/pull/7849)

- Refined documentation of `oneflow.argsort` (https://github.com/Oneflow-Inc/oneflow/pull/7844)

- Refined documentation of `Tensor.zero_`, `Tensor.is_contiguous`, `Tensor.is_cuda`, and `oneflow.nn.functional.layer_norm` op (https://github.com/Oneflow-Inc/oneflow/pull/7839)

- Refined interface documentation of `support_sparse` and `step` in `oneflow.optim.Adamw`, `oneflow.optim.SGD` (https://github.com/Oneflow-Inc/oneflow/pull/7848)

- Refined interface documentation of `LambdaLR.step`, `ReduceLROnPlateau.in_cooldown`, and `ReduceLROnPlateau.is_better` (https://github.com/Oneflow-Inc/oneflow/pull/7848)

- Refined interface documentation of `nn.Module` (https://github.com/Oneflow-Inc/oneflow/pull/8190)

- Refined interface documentation of `oneflow.optim.lr_scheduler.PolynomialLR` (https://github.com/Oneflow-Inc/oneflow/pull/8430)

- Refined docs and formula illustrations for `oneflow.nn.CombinedMarginLoss` (https://github.com/Oneflow-Inc/oneflow/pull/8206)

- Refined documentation of `oneflow.logical_and`, `oneflow.logical_or`, `oneflow.logical_xor`, and `oneflow.logical_not` (https://github.com/Oneflow-Inc/oneflow/pull/8297)

- Fixed the bug in the documentation of quantization ops (https://github.com/Oneflow-Inc/oneflow/pull/8333)

- Updated solution in Troubleshooting for the case when `libunwind.h` is not found (https://github.com/Oneflow-Inc/oneflow/pull/8336)

- Restructured API documentation based on features; added and refined docs of features that are unique to OneFlow (https://github.com/Oneflow-Inc/oneflow/pull/8392)

0.7.0

- Fundamental features enter into Beta Stage, meeting most requirements of users;

- Advanced features enter into Alpha Stage, meeting standard requirements of users;

- ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, and other models are supported；

Feature of nn.Graph

- Static and dynamic casting of operators under Static Graph enter into Beta Stage from Alpha Stage

- Adds the unit test of static execution for all legal operators under nn.Graph, and automated unit test is ready;

- Supports more flexible inputs and outputs, including List/Tuple/Dict and their nesting, and fixs the Tuple problem of producing a return size of "1";

- Adds backward automatic test;

- Optimizer and LR Scheduler under Static Graph enter into Beta Stage from Alpha Stage.

- Adds more built-in LR schedulers, including WarmupLR, CosineAnnealingWarmRestarts and other common schedulers, and provides SequentialLR and ChainedScheduler to enable scheduler with different combination capacity;

- Refactors scheduler's get_lr function, converting it to the implementation of pure function. This change permits to use schedulers in combination by changing the calculation of lr from iterative solution to analytical solution;

- Adds "is_sparse" parameter for `add_optimizer` interface, supporting sparse updates under graph mode. Optimizers that support sparse updates include Adam and SGD, while optimizers under Eager mode don't support sparse updates yet. Subsequent version will support both sparse updates and sparse tensor. The feature is at Pre-alpha Stage;

- Adds Debug print feature for LR and Step, for which you only need to turn on LR Scheduler's `verbose` button.

- `state_dict` and `load_state_dict` under Static Graph are newly added, which allow to resume training from last checkpoint. The feature is at Beta Stage;

- Debug under Static Graph enters into Beta Stage from Alpha Stage;

- Adds `debug(2)`、`debug(3)` that allow to find out problems in nn.Module, by locating the Python code of operators at c++ layer and locating forward graph creation and inference for operators;

- Adds the display of memory overhead

- ZeRO-DP under Static Graph is newly added, which allows to reducememory overhead related to Optimizer under data parallelism, and the feature is at Alpha Stage;

- Global Tensor under Static Graph supports multiple parallel methods, and the feature is between Alpha Stage and Beta Stage;

- It is utilized in LiBai and other model libraries;

- It is widely utilized in OneFlow's model libraries, and the coverage of unit test is still ongoing;

- 1D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Beta Stage;

- 2D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Alpha Stage;

- Conversion from 1D to ND or ND to 1D is newly supported, and the feature is at Alpha Stage;

- Random conversion of 2D SBP is newly supported, and the feature is at Alpha Stage；

- Testing of 1D&2D single operator is still ongoing, and the feature is at Pre-alpha Stage；

- Selecting SBP with semi-automatic derivation is supported, and the feature is at Pre-alpha Stage；

- For Gradient Accumulation under Static Graph, we refactor and repair support for Reshape and add API documentation. For the input of `mini-batch` interface, the future version will offer the input of `micro-batch` with better experience, and the feature is from Pre-Alpha to Alpha Stage；

- For pipeline parallelism under Static Graph, the tutorial is perfected, and pipeline parallelism is available in Libai and other model libraries. The feature is at Beta Stage;

- For automatic mixed precision (AMP) under Static Graph, the API documentation is newly added. The feature is from Pre-Alpha to Alpha Stage；

- For Activation Checkpointing under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

- For Op Fuse optimization under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

- For XLA/TensorRT/OpenVINO execution under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

Tutorials

- en https://docs.oneflow.org/en/master/basics/08_nn_graph.html
- zh https://docs.oneflow.org/master/basics/08_nn_graph.html

API Documentation

- en https://oneflow.readthedocs.io/en/master/graph.html
- zh https://start.oneflow.org/oneflow-api-cn/graph.html

Tutorials of pipeline parallelism：

- en https://docs.oneflow.org/en/master/parallelism/06_pipeline.html
- zh https://docs.oneflow.org/master/parallelism/06_pipeline.html

Model support under nn.Graph

- Training ResNet50 with single-node single-GPU or single-node multi-GPU is supported, https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50
- Wide and Deep model is supported, https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems/wide_and_deep
- GPT、Bert、Swin Transformer in Libai are supported, https://github.com/Oneflow-Inc/libai
- Functioanl problems in support for above models are resolved;

3. Performance optimization of Eager

- The performance of Eager is deeply optimized. When OneFlow run Swin-Transformer's model performance on V100 GPU, single-GPU card delivers a 25% speedup than PyTorch, and 8 single GPU card 10% speedup;

- The communication scheduling policy for NCCL in DDP is optimized;

- DDP supports the optimization of AllReduce fuse, reducing additional overhead generated by fragmented AllReduce, with a 5% performance speedup when it is tested on ResNet50;

- VM supports the optimization of **instruction fusion**, significantly saving scheduling overhead of Kernel;

- Additional memory overhead is optimized when CPU overload is too high;

- Eager DataLoader supports the optimization of inter-process memory sharing;

- The performance of Clip Grad is optimized;

4. Improvements of operators

- OneFlow is successfully adapted to oneDNN for CPU operators acceleration.

The performance of CPU operators such as unary and binary element-wise is improved by 4 times, and the speed of Swin-Transformer's dataloader is improved by 2.5 times. https://github.com/Oneflow-Inc/oneflow/pull/7319

- Adds the functionality of inter-process shared memory to Dataloader, which greatly improves the performance of DataLoader in DDP.

- Adds Bool type Tensor. https://github.com/Oneflow-Inc/oneflow/pull/7523

- Realizes to_contiguous that view relied on. https://github.com/Oneflow-Inc/oneflow/pull/7670

- Adds Scalar div operators. https://github.com/Oneflow-Inc/oneflow/pull/7483

- Adds Lamb optimizer. https://github.com/Oneflow-Inc/oneflow/pull/7389

- Adds Polynomial Learning Rate Scheduler. https://github.com/Oneflow-Inc/oneflow/pull/7260

- Adds tensor_split and as_strided operators. https://github.com/Oneflow-Inc/oneflow/pull/7258 & https://github.com/Oneflow-Inc/oneflow/pull/7275

- Adds cumprod operators. https://github.com/Oneflow-Inc/oneflow/pull/7278

- Adds Tensor.T() and oneflow.t() operators. https://github.com/Oneflow-Inc/oneflow/pull/7269

- Adds normalize operators. https://github.com/Oneflow-Inc/oneflow/pull/7113

- Adds the inplace version of div and sub operators. https://github.com/Oneflow-Inc/oneflow/pull/7293

- Adds the feature of Module.zero_grad. https://github.com/Oneflow-Inc/oneflow/pull/7587/

- Adds the feature of Scalar Tensor being the index to do list indexing. https://github.com/Oneflow-Inc/oneflow/pull/7597

- Adds support for Leaky ReLU operators half type. https://github.com/Oneflow-Inc/oneflow/pull/7569

- Adds support for mask select operators. https://github.com/Oneflow-Inc/oneflow/pull/7492

- Adds non-reduce communication operations such as Bool type Broadcast and Allgather. https://github.com/Oneflow-Inc/oneflow/pull/7366

- Develops autotest that supports eager global based on an autotest framework. https://github.com/Oneflow-Inc/oneflow/pull/7204

- Optimizes performance for ReduceSum CUDA Kernel. https://github.com/Oneflow-Inc/oneflow/pull/7684

- Optimizes CUDA Kernel of gather operators. https://github.com/Oneflow-Inc/oneflow/pull/7351

- Optimizes the performance for CUDA Kernel of MaxPool and AvgPool operators in NCHW. https://github.com/Oneflow-Inc/oneflow/pull/7426 & https://github.com/Oneflow-Inc/oneflow/pull/7451

- Optimizes the backward computing of PReLU operators, which can save more memory in general. https://github.com/Oneflow-Inc/oneflow/pull/7600

- Optimizes backward Kernel of LayerNorm to further save memory. https://github.com/Oneflow-Inc/oneflow/pull/6996

- Supports passing single int in stride and dilation in Conv1D/2D/3D and DeConv1D/2D/3D Kernel. Adds Tensor.zero_() interface that aligns with PyTorch tensor.norm, torch.max and torch.min.
Supports inplace in flow.nn.functional.dropout. https://github.com/Oneflow-Inc/oneflow/pull/7593

- Fixes bug where the BatchNorm module raises an error when affine=False. https://github.com/Oneflow-Inc/oneflow/pull/7755

- Fixes Maximum and Mimimum backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7519

- Fixes bug where the result of var operators is unexpected in some cases. https://github.com/Oneflow-Inc/oneflow/pull/7517

- Fixes incorrect behavior of Tensor deepcopy bug. https://github.com/Oneflow-Inc/oneflow/pull/7490

- Fixes bug where input index is scalar tensor in slice operators. https://github.com/Oneflow-Inc/oneflow/pull/7479

- Fixes bug where BinaryCrossEntropy can produce nan in half. https://github.com/Oneflow-Inc/oneflow/pull/7476

- Fixes bug where an error is raised when the base and exponent of pow operators are respectively real number type and Tensor type. https://github.com/Oneflow-Inc/oneflow/pull/7729

- Fixes stack operators backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7363

- Fixes inefficiency problem caused by CPU synchronization when clip grad is executed on CUDA with the default configuration. https://github.com/Oneflow-Inc/oneflow/pull/7304

- Fixes the SBP inference of Batch Gather and Unsorted Batch Segment Sum operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7590

- Fixes Physical Shape inference of Affine Grid operators, fixes the unexpected result bug in some SBP cases, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7578

- Fixes the problem that arange operators don't support generating 0 size tensor, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7576

- Fixes the incorrect SBP inference of flip operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7496

- Fixes advanced indexing and zeroslike operators SBP bugs. https://github.com/Oneflow-Inc/oneflow/pull/7238

- Fixes bug where Eager global inplace might not be successful. https://github.com/Oneflow-Inc/oneflow/pull/7348

5. Supporting einsum & view mechanism

Adds `einsum` operators. `einsum` provides a set of concise but elegant rules, which can implement tensor operations including but not limited to: inner product, outer product, tensor multiplication, tensor transposition and tensor contraction, etc. Proficient use of `einsum` allows you to easily implement various complex tensor operations and be less error-prone. https://github.com/Oneflow-Inc/oneflow/pull/7526

Adds `view` mechanism. The view mechanism allows the common operators to reuse/share Tensor's memory, and the memory can be saved by reducing the Kernel Launch/Compute process. At present, new view operators that do not change the tensor.is_contiguous() property have been added, such as reshape, view, squeeze, unsqueeze, etc.: https://github.com/Oneflow-Inc/oneflow/pull/7503 More view operators will be added later (such as transpose, permute, narrow, expand, and unfold).

6. Improvements of the complier

- OneFlow is officially connected to the MLIR ecosystem, and the OneFlow Dialect component is complete. Successfully completes OneFlow Job (computation graph of OneFlow nn.Graph) and RoundTrip of MLIR, and runs RoundTrip tests on all operators of OneFlow in CI process.

- Implements static graph optimization with a series of automatic fused operators based on MLIR DRR to accelerate OneFlow model training and inference.

7. OneFlow Serving

OneFlow Serving v0.1.0 comes out with the following features:

- Provides OneFlow C++ API used for inference, supporting model loading and static graph inference.

- The model weights and the computation graph in MLIR format can be saved simultaneously by running `flow.save(graph)` in Python. They can be loaded in C++ API (while loading computation graph is not supported in Python API at present).

- Supports inference of OneFlow model using TensorRT and OpenVINO automatically without model conversion (based on OneFlow XRT module), achieving better acceleration on NVIDIA GPU and Intel CPU.

- Implements Triton OneFlow backend

- Provides out-of-the-box Docker image.
- Supports auto configuration: only the model path needs to be given, and no Triton configuration file needs to be written in the configuration.

- Welcome to use the [project deployed with Triton OneFlow backend](https://oneflow.cloud/drill/#/project/public/code?id=7fc904d8dbe0069820da5d6d32a764fe) launched on OneFlow Cloud Platform.

8. LiBai

LiBai is a toolbox for massively distributed parallel training of Transformer. Compared with custom code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode. The 0.1.0 version mainly supports the following features and models:

Features:

- Data Parallelism
- 1D Tensor Parallelism
- Pipeline Parallelism
- Unified Distributed Layers
- Extensible for new parallelism
- Mixed Precision Training
- Activation Checkpointing
- Gradient Accumulation
- Gradient Clip
- ZeRO
- More flexible "LazyConfig" configuration system
- Easy-to-use `Trainer` and `Evaluator`
- Data preprocessing supporting images and texts

Models:

- `Bert` (3D Parallelism)
- `GPT-2` (3D Parallelism)
- `ViT` (3D Parallelism)
- `Swin-Transformer` (Data Parallelism)
- Supports fine-tuning tasks in `projects/`
- Supports text classification tasks in `projects/`

9. flow-vison

flowvision 0.1.0 stable version comes out with the following improvements based on the previous version:

- Adds initialization method `trunc_normal_`
- Adds `DeiT` model, rebuilt `VisionTransformer` model
- Adds `ConvNeXt` model
- Adds `ReXNet` model
- Supports Learning Rate Schedule in `PolyLRScheduler` and `TanhLRScheduler`
- Fixes the use of `F.normalize` in SSD model
- Fixes bugs in `EfficientNet` and `Res2Net`
- Fixes weights problem in `vit_small_patch32_384` and `res2net50_48w_2s` models
- Rebuilds `model zoo` and runs more complete tests on existing models
- Rebuilds `load_state_dict_from_url` method to automatically save the downloaded weights in the cache folder
- Improves documents about `Getting Started` and `flowvision.models`

The 0.2.0 version of flowvision is already in progress. A large number of new models will be added based on the 0.1.0 version, and the documentation will be improved, so stay tuned.

0.6.0

> OneFlow has been open sourced for 528 days since July 31,2020. Today OneFlow v0.6.0 came out. Welcome to use OneFlow v0.6.0. We would love to hear feedback!

This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:

- Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
- A larger number of common operators
- Improvements in static graphs and ConsistentTensor
- Serving functionality as Nvidia Triton's backend
- Richer visual pre-training models similar to torchvision and timm
- Better OneFlow-ONNX conversion functionality

The following are the detailed release notes.

Framework

1. Performance Optimization of nn.Graph

- Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc
- Optimized nn.Graph's performance in high frequency iterative training scenarios
- Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible

2. Performance Optimization of Eager

- Compared to v0.5.0, v0.6.0 OneFlow Eager's training speed increases dramatically in small batch scenarios
- Optimized the scheduling logic for virtual machines
- Optimized get/set item
- Optimized tensor.numel()
- Optimized oneflow.Size()

3. Performance Optimization of Operators

- Optimized some operators that affect the performance of new model to significantly improve the training speed of these models
- Added fused dropout operators
- Added CPU-version group deconv and optimized its performance
- Added inplace-version implementation for operators mul, hard_sigmoid, and sin
- Optimized performance for linalg.vector_norm when ord=2.0 and it is 4 times faster than before
- Deeply optimized the LayerNorm operator, making its performance greatly better than PyTorch and Apex implementation. For more information, refer to [How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization](https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8)
- Realized automatic type promotion of operators. For more information, refer to [Automatic Type Promotion of Operators in OneFlow](https://oneflow2020.medium.com/automatic-type-promotion-in-oneflow-9f8c6079b81)

4. Performance Optimization of Eager's Memory Occupation

- Optimized some operators' memory occupation during net training, making the same computing device run bigger models or data
- Optimized the backward memory occupation of broadcast binary operators
- Optimized the backward memory occupation of Slice operator
- Optimized the memory occupation of LayerNorm operator

5. More Useful Features to Static Computation Graph (nn.Graph)

- The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs
- To help the debugging of static graphs, we added the following features:
- debug mode supports graph.debug(1) printing more information about the composition
- Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
- Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
- Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests
- Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph
- To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation
- Enabled nn.Graph to support more usage scenarios:
- Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
- Supported for using the following nn.Module Containers with nn.Graph:
Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
- Supported for creating Optimizer in the init function of nn.Graph
- Supported multiple parameters sharing the same Tensor with nn.Graph
- Supported for scenarios where the actual number of processes is greater than the number of GPU devices
- Supported more Inplace execution for Consistent SBP inference under nn.Graph

6. A Larger Number of Operators

- Newly added operators: cumsum, meshgrid, linspace, diagonal, movedim, roialign, nms, arccos, and roll
- Newly added operators: masked_fill, floordiv, glu, pool1d, pool2d, and pool3d
- Newly added unfold and fold operators: [Adding Unfold and Fold Ops into OneFlow](https://oneflow2020.medium.com/adding-unfold-and-fold-ops-into-oneflow-a4ae5f0ca328)
- Achieved automatic data type promotion of operators: [[Automatic Type Promotion of Operators in OneFlow](https://oneflow2020.medium.com/automatic-type-promotion-in-oneflow-9f8c6079b81)
- Added expand and repeat operators: [Added Expand and Repeat Operators into OneFlow](https://oneflow2020.medium.com/add-expand-and-repeat-ops-into-oneflow-42c42be69429)
- Supported one-click switching for the current torchvision library models by the command `import oneflow as torch`

7. User-Defined autograd.Function

Users can customize autograd.Function just like using Torch.

8. Added Basic Serving Functionality

Serving functionality of models is provided by OneFlow as Nvidia Triton's backend.

9. Added Some Functionalities of Tensor (ConsistentTensor)

- Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
- Supported Tensor's conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
- Supported constructing ConsistentTensor from numpy
- oneflow.from_numpy()
- oneflow.numel()
- tensor.expand_as()

Model

[Released flowvision 0.0.54](https://github.com/Oneflow-Inc/vision).

1. Richer Visual Pre-training Models

Image Classification

- CNN series: `ResNet`, `DenseNet`, `VGG`, `ResNext`, `EfficientNet`, etc
- Vision Transformer series: `ViT`, `PVT`, `Swin-Transformer`, etc
- Vision MLP series: `Mlp-Mixer`, `Res-MLP`, `g-MLP`, etc

Object Detection

- SSD, SSDLite
- Faster R-CNN
- RetinaNet

Image Segmentation

- FCN
- DeepLabV3

Style Migration

- StyleNet: Suport Styles `sketch`, `candy`, `mosaic`, `rain_princess`, and `undie`

2. Implemented Data Augmentation Operations Similar to torchvision

For data augmentation operations like `CenterCrop` and `ColorJitter` similar to torvhvision, developers can run `import flowvision as torchvision`to execute in most scenarios.

3. Implemented Advanced Data Augmentation Opertations Similar to timm

Advanced data augmentation opertations implemented in flowvision.data:

- Mixup
- CutMix
- Random-Erasing
- AutoAugment
- RandAugment
- AugMix

4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model

flowvision.layers.attention

- Implemented plug-and-play attention models like `Non-Local`, `SELayer`, `CBAM`, `BAM`, `ECA`, etc

flowvision.layers.blocks

- Provided modules that might be used for model building like `PatchEmb`, `Pooler`, `ConvBnAct`, etc

flowvision.layers.regularization

- Provided regularization modules such as `drop-path`, `drop-block`, and `stochastic depth` to improve model generalization ability
- Provided separate files such as `activation` and `weight_init` to improve components like `activation function` and `initialize method`

OneFlow-ONNX Conversion

Updated OneFlow to ONNX toolkit:

- Supported OneFlow model converting to ONNX model in CPU or GPU mode
- Added test cases for operators and models to align all classification models in OneFlowVision library
- Fixed onnx-runtime bugs during PReLU conversion
- Compatible with v1.9.0 onnx-runtime library or later versions
- Released v0.5.4 oneflow-onnx package, and developers can run `pip install oneflow-onnx` to experience

Page 1 of 6

Releases

Has known vulnerabilities

Oneflow

Page 1 of 6

11.2

2.0

0.9.0

0.8.0

0.7.0

0.6.0

Page 1 of 6

Links

Releases