[CPU] I64 native support. #18236

nshchego · 2023-06-26T11:53:40Z

Details:

INT64 support in CPU plugin

Tickets:

100339

ilya-lavrenov · 2023-06-27T20:56:06Z

src/inference/dev_api/cpp_interfaces/interface/ie_internal_plugin_config.hpp

+/**
+ * @brief Enables inference with INT64 data type in CPU plugin if it's presented in the original model.
+ */
+DECLARE_CONFIG_KEY(CPU_NATIVE_I64);


should be discussed before merging

maxnick · 2023-07-03T09:24:17Z

@antonvor , could you please review?

v-Golubev · 2023-07-10T13:24:42Z

src/plugins/intel_cpu/src/nodes/gather.cpp

@@ -331,6 +354,9 @@ void Gather::prepareParams() {
        } else if (x64::mayiuse(x64::avx2)) {
            selectedPD->setImplementationType(jit_avx2);
        }
+    } else {
+        // TODO: Add tests


Have these tests been added? If yes, we can remove this comment

v-Golubev · 2023-07-10T13:42:38Z

src/plugins/intel_cpu/src/nodes/reorder.cpp

@@ -29,12 +29,12 @@ bool Reorder::isExecutable() const {
    return Node::isExecutable() && !isOptimized;
 }

-Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr context) :
+Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr &context) :


Can we use a common code style?

Suggested change

Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr &context) :

Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr& context) :

v-Golubev · 2023-07-10T13:51:16Z

src/plugins/intel_cpu/src/nodes/split.cpp

+    if (axisOp->get_element_type() == ov::element::i64) {
+        axis = axisOp->get_data_ptr<int64_t>()[0];
+    } else {
+        axis = axisOp->cast_vector<int64_t>()[0];


Why do we use different methods here? Can we just always cast constant to int64_t vector?

The first one is cheaper than the fn 'cast_vector'.

src/plugins/intel_cpu/src/nodes/subgraph.cpp

v-Golubev · 2023-07-10T13:55:37Z

src/plugins/intel_cpu/src/nodes/tensoriterator.cpp

-        IE_ASSERT(to->GetDataType() == memory::data_type::s32);
+        // IE_ASSERT(to->GetDataType() == memory::data_type::s32);


Should we remove this assert and correct the corresponding comment upper?

src/plugins/intel_cpu/src/emitters/x64/jit_eltwise_emitters.cpp

v-Golubev · 2023-07-12T15:47:52Z

src/plugins/intel_cpu/src/emitters/x64/jit_eltwise_emitters.cpp

+    // TODO: Actually the Result is bool in U8 representation. 0x01 or 0xFF - is there a difference for real models?
+    // Remove all vpsrld instructions if there is no difference.


Should we check it within this PR?

v-Golubev · 2023-07-12T15:48:43Z

src/plugins/intel_cpu/src/emitters/x64/jit_eltwise_emitters.cpp

 jit_greater_equal_emitter::jit_greater_equal_emitter(x64::jit_generator *host, x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& node,
-                                                     Precision exec_prc)
+                                                     const Precision &exec_prc)


Different style in one method (for host, node, and exec_prc)

v-Golubev · 2023-07-12T16:04:41Z

src/plugins/intel_cpu/src/nodes/broadcast.cpp

+            auto idxShape = (ov::as_type<ov::op::v0::Constant>(op->get_input_node_ptr(TARGET_SHAPE_IDX)))->get_vector<int64_t>();
+            targetShape.reserve(idxShape.size());
+            targetShape.assign(idxShape.begin(), idxShape.end());


As I understood, here we copy idxShape values to targetShape with conversion to Dim. Maybe we can just use cast_vector method to do the same?

Suggested change

auto idxShape = (ov::as_type<ov::op::v0::Constant>(op->get_input_node_ptr(TARGET_SHAPE_IDX)))->get_vector<int64_t>();

targetShape.reserve(idxShape.size());

targetShape.assign(idxShape.begin(), idxShape.end());

targetShape = (ov::as_type<ov::op::v0::Constant>(op->get_input_node_ptr(TARGET_SHAPE_IDX)))->cast_vector<Dim>();

v-Golubev

2nd part of review comments. Emitters part left unreviewed

v-Golubev · 2023-07-13T14:34:33Z

src/plugins/intel_cpu/src/emitters/x64/jit_snippets_emitters.cpp

+        case element::i64: {
+            value = ov::as_type_ptr<ov::op::v0::Constant>(n)->cast_vector<int64_t>()[0];
+            break;
+        }


If I am not mistaken, value has int32_t type and int64_t Constant's value is converted to int32_t. Maybe we can just instantiate cast_vector with int32_t to avoid additional cast?

v-Golubev · 2023-07-13T14:44:41Z

src/plugins/intel_cpu/src/nodes/convert.cpp

@@ -109,9 +104,9 @@ void Convert::initSupportedPrimitiveDescriptors() {
        supportedPrimitiveDescriptors.emplace_back(config, impl_desc_type::unknown);
    } else if (inputShapes.size() == 1 && outputShapes.size() == 1) {
        const Shape& insShape = getInputShapeAtPort(0);
-        auto insPrecision = getOriginalInputPrecisionAtPort(0);
+        const auto &insPrecision = getOriginalInputPrecisionAtPort(0);


Minor typo:

Suggested change

const auto &insPrecision = getOriginalInputPrecisionAtPort(0);

const auto &inPrecision = getOriginalInputPrecisionAtPort(0);

src/plugins/intel_cpu/src/nodes/cum_sum.cpp

v-Golubev · 2023-07-13T15:19:05Z

src/plugins/intel_cpu/src/nodes/non_zero.cpp

-    if (!supportedPrimitiveDescriptors.empty())
-        return;


Why do we remove this check? It exists in all cpu nodes. I propose to leave it

v-Golubev · 2023-07-13T15:19:38Z

src/plugins/intel_cpu/src/nodes/non_zero.cpp

        IE_THROW() << "Can't create primitive descriptor for NonZero layer with name: " << getName() << " doesn't support "
                   << inPrc.name() << " precision on 0 port";
    }
+    auto outPrc = getOriginalOutputPrecisionAtPort(0);
+    if (!one_of(outPrc, /*Precision::I64,*/ Precision::I32)) {


Should we uncomment I64 precision?

src/plugins/intel_cpu/tests/functional/single_layer_tests/concat.cpp

src/plugins/intel_cpu/tests/functional/single_layer_tests/conversion.cpp

v-Golubev · 2023-07-14T09:40:30Z

src/plugins/intel_cpu/tests/functional/single_layer_tests/non_max_suppression.cpp

+            if (maxBoxPrec == ElementType::i64) {
+                params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i64, inputDynamicShapes.back()));
+            } else {
+                params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i32, inputDynamicShapes.back()));
+            }


Can we just use maxBoxPrec as a parameter?

Suggested change

if (maxBoxPrec == ElementType::i64) {

params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i64, inputDynamicShapes.back()));

} else {

params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i32, inputDynamicShapes.back()));

}

params.push_back(std::make_shared<ov::opset1::Parameter>(maxBoxPrec, inputDynamicShapes.back()));

v-Golubev · 2023-07-14T09:43:43Z

src/plugins/intel_cpu/tests/functional/single_layer_tests/non_max_suppression.cpp

-            maxOutBoxesPerClassNode = builder::makeConstant(maxBoxPrec, ngraph::Shape{}, std::vector<int32_t>{maxOutBoxesPerClass});
+            if (maxBoxPrec == ElementType::i64) {
+                maxOutBoxesPerClassNode = builder::makeConstant(maxBoxPrec, ov::Shape{}, std::vector<int64_t>{maxOutBoxesPerClass});
+            } else {
+                maxOutBoxesPerClassNode = builder::makeConstant(maxBoxPrec, ov::Shape{}, std::vector<int32_t>{maxOutBoxesPerClass});
+            }


maxOutBoxesPerClass has int32_t type. Do we really need to cast it to int64_t?

v-Golubev · 2023-07-14T09:51:04Z

src/plugins/template/backend/ops/reduce.cpp

+template <ov::element::Type_t ET>
+bool evaluate(const std::shared_ptr<ov::op::v4::ReduceL1>& op, const ov::HostTensorVector& outputs, const ov::HostTensorVector& inputs) {
+    using T = typename ov::element_type_traits<ET>::value_type;
+std::cout << "evaluate ReduceL1" << std::endl;


Please remove std::cout in this file

v-Golubev

Also, I would like to clarify one point about validation. Should we run accuracy validation (at least on the subsets on the local machine using validation scripts) to check that I64 related changes don't introduce any regressions?

v-Golubev · 2023-07-14T10:33:29Z

src/plugins/intel_cpu/src/nodes/reduce.cpp

-    if (!supportedPrimitiveDescriptors.empty())
-        return;


Why do we remove this check?

v-Golubev · 2023-07-14T10:37:23Z

src/plugins/intel_cpu/src/nodes/reduce.cpp

+                            // parallel_for(IH, [&](size_t ih){
+                            for (size_t ih = 0; ih < IH; ih++) {
                                size_t oh = ih; GET_PTR_NCDH_PLN;
                                reduce_kernel_process(in_ptr_ncdh, out_ptr_ncdh, IW, 1);
-                            });
+                            // });


Should we return parallel loop here?

v-Golubev · 2023-07-14T10:41:42Z

src/plugins/intel_cpu/src/nodes/reduce.cpp

+    if (dst_data_size == 8) {
+        auto src_data = reinterpret_cast<const int64_t *>(proc_ptr);
+        auto dst_data = reinterpret_cast<int64_t *>(out_ptr);
+        parallel_for2d(DIM0, stride1, [&](size_t b, size_t j) {
+            auto src_off = b * stride0 + j * DIM1;
+            auto dst_off = b * stride0 + j;
+            for (size_t dim1 = 0; dim1 < DIM1; dim1++) {
+                dst_data[dst_off] = src_data[src_off];
+                src_off++;
+                dst_off += stride1;
+            }
+        });
+    } else if (dst_data_size == 4) {


Can we create a template function for this logic to avoid code duplication?

v-Golubev · 2023-07-14T10:47:53Z

src/plugins/intel_cpu/src/nodes/reduce.cpp

+        auto src_data = reinterpret_cast<const int64_t *>(proc_ptr);
+        auto dst_data = reinterpret_cast<int64_t *>(out_ptr);
+        parallel_for2d(DIM0, stride1, [&](size_t b, size_t j) {
+            auto src_off = b * src_stride0 + j * blockLen;
+            auto dst_off = b * dst_stride0 + j;
+            for (size_t dim1 = 0; dim1 + blockLen <= DIM1; dim1 += blockLen) {
+                for (size_t k = 0; k < blockLen; k++) {
+                    dst_data[dst_off] = src_data[src_off];
+                    src_off++;
+                    dst_off += stride1;
+                }
+                src_off += (stride1 - 1) * blockLen;
+            }
+            size_t tail = DIM1 % blockLen;
+            for (size_t k = 0; k < tail; k++) {
+                dst_data[dst_off] = src_data[src_off];
+                src_off++;
+                dst_off += stride1;
+            }
+        });
+    } else if (dst_data_size == 4) {


The same: it would be great if we had some template function for these computations

v-Golubev · 2023-07-14T10:51:40Z

src/plugins/intel_cpu/src/nodes/reduce.cpp

+    const auto input_prec = getOriginalInputPrecisionAtPort(REDUCE_DATA);
+    const auto output_prec = getOriginalOutputPrecisionAtPort(0);


Minor:

Suggested change

const auto input_prec = getOriginalInputPrecisionAtPort(REDUCE_DATA);

const auto output_prec = getOriginalOutputPrecisionAtPort(0);

const auto& input_prec = getOriginalInputPrecisionAtPort(REDUCE_DATA);

const auto& output_prec = getOriginalOutputPrecisionAtPort(0);

v-Golubev · 2023-07-14T13:30:36Z