-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] I64 native support. #18236
[CPU] I64 native support. #18236
Conversation
/** | ||
* @brief Enables inference with INT64 data type in CPU plugin if it's presented in the original model. | ||
*/ | ||
DECLARE_CONFIG_KEY(CPU_NATIVE_I64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be discussed before merging
@antonvor , could you please review? |
30426ff
to
98eeff9
Compare
@@ -331,6 +354,9 @@ void Gather::prepareParams() { | |||
} else if (x64::mayiuse(x64::avx2)) { | |||
selectedPD->setImplementationType(jit_avx2); | |||
} | |||
} else { | |||
// TODO: Add tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have these tests been added? If yes, we can remove this comment
@@ -29,12 +29,12 @@ bool Reorder::isExecutable() const { | |||
return Node::isExecutable() && !isOptimized; | |||
} | |||
|
|||
Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr context) : | |||
Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr &context) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a common code style?
Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr &context) : | |
Reorder::Reorder(const std::shared_ptr<ngraph::Node>& op, const GraphContext::CPtr& context) : |
if (axisOp->get_element_type() == ov::element::i64) { | ||
axis = axisOp->get_data_ptr<int64_t>()[0]; | ||
} else { | ||
axis = axisOp->cast_vector<int64_t>()[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use different methods here? Can we just always cast constant to int64_t
vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one is cheaper than the fn 'cast_vector'.
IE_ASSERT(to->GetDataType() == memory::data_type::s32); | ||
// IE_ASSERT(to->GetDataType() == memory::data_type::s32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this assert and correct the corresponding comment upper?
src/plugins/intel_cpu/src/emitters/x64/jit_eltwise_emitters.cpp
Outdated
Show resolved
Hide resolved
// TODO: Actually the Result is bool in U8 representation. 0x01 or 0xFF - is there a difference for real models? | ||
// Remove all vpsrld instructions if there is no difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check it within this PR?
jit_greater_equal_emitter::jit_greater_equal_emitter(x64::jit_generator *host, x64::cpu_isa_t host_isa, const std::shared_ptr<ov::Node>& node, | ||
Precision exec_prc) | ||
const Precision &exec_prc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different style in one method (for host
, node
, and exec_prc
)
auto idxShape = (ov::as_type<ov::op::v0::Constant>(op->get_input_node_ptr(TARGET_SHAPE_IDX)))->get_vector<int64_t>(); | ||
targetShape.reserve(idxShape.size()); | ||
targetShape.assign(idxShape.begin(), idxShape.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood, here we copy idxShape
values to targetShape
with conversion to Dim. Maybe we can just use cast_vector method to do the same?
auto idxShape = (ov::as_type<ov::op::v0::Constant>(op->get_input_node_ptr(TARGET_SHAPE_IDX)))->get_vector<int64_t>(); | |
targetShape.reserve(idxShape.size()); | |
targetShape.assign(idxShape.begin(), idxShape.end()); | |
targetShape = (ov::as_type<ov::op::v0::Constant>(op->get_input_node_ptr(TARGET_SHAPE_IDX)))->cast_vector<Dim>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd part of review comments. Emitters part left unreviewed
case element::i64: { | ||
value = ov::as_type_ptr<ov::op::v0::Constant>(n)->cast_vector<int64_t>()[0]; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, value has int32_t
type and int64_t
Constant's value is converted to int32_t
. Maybe we can just instantiate cast_vector
with int32_t
to avoid additional cast?
@@ -109,9 +104,9 @@ void Convert::initSupportedPrimitiveDescriptors() { | |||
supportedPrimitiveDescriptors.emplace_back(config, impl_desc_type::unknown); | |||
} else if (inputShapes.size() == 1 && outputShapes.size() == 1) { | |||
const Shape& insShape = getInputShapeAtPort(0); | |||
auto insPrecision = getOriginalInputPrecisionAtPort(0); | |||
const auto &insPrecision = getOriginalInputPrecisionAtPort(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo:
const auto &insPrecision = getOriginalInputPrecisionAtPort(0); | |
const auto &inPrecision = getOriginalInputPrecisionAtPort(0); |
if (!supportedPrimitiveDescriptors.empty()) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove this check? It exists in all cpu nodes. I propose to leave it
IE_THROW() << "Can't create primitive descriptor for NonZero layer with name: " << getName() << " doesn't support " | ||
<< inPrc.name() << " precision on 0 port"; | ||
} | ||
auto outPrc = getOriginalOutputPrecisionAtPort(0); | ||
if (!one_of(outPrc, /*Precision::I64,*/ Precision::I32)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we uncomment I64 precision?
src/plugins/intel_cpu/tests/functional/single_layer_tests/conversion.cpp
Outdated
Show resolved
Hide resolved
if (maxBoxPrec == ElementType::i64) { | ||
params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i64, inputDynamicShapes.back())); | ||
} else { | ||
params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i32, inputDynamicShapes.back())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use maxBoxPrec
as a parameter?
if (maxBoxPrec == ElementType::i64) { | |
params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i64, inputDynamicShapes.back())); | |
} else { | |
params.push_back(std::make_shared<ov::opset1::Parameter>(element::Type_t::i32, inputDynamicShapes.back())); | |
} | |
params.push_back(std::make_shared<ov::opset1::Parameter>(maxBoxPrec, inputDynamicShapes.back())); |
maxOutBoxesPerClassNode = builder::makeConstant(maxBoxPrec, ngraph::Shape{}, std::vector<int32_t>{maxOutBoxesPerClass}); | ||
if (maxBoxPrec == ElementType::i64) { | ||
maxOutBoxesPerClassNode = builder::makeConstant(maxBoxPrec, ov::Shape{}, std::vector<int64_t>{maxOutBoxesPerClass}); | ||
} else { | ||
maxOutBoxesPerClassNode = builder::makeConstant(maxBoxPrec, ov::Shape{}, std::vector<int32_t>{maxOutBoxesPerClass}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxOutBoxesPerClass
has int32_t
type. Do we really need to cast it to int64_t
?
template <ov::element::Type_t ET> | ||
bool evaluate(const std::shared_ptr<ov::op::v4::ReduceL1>& op, const ov::HostTensorVector& outputs, const ov::HostTensorVector& inputs) { | ||
using T = typename ov::element_type_traits<ET>::value_type; | ||
std::cout << "evaluate ReduceL1" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove std::cout
in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would like to clarify one point about validation. Should we run accuracy validation (at least on the subsets on the local machine using validation scripts) to check that I64 related changes don't introduce any regressions?
if (!supportedPrimitiveDescriptors.empty()) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove this check?
// parallel_for(IH, [&](size_t ih){ | ||
for (size_t ih = 0; ih < IH; ih++) { | ||
size_t oh = ih; GET_PTR_NCDH_PLN; | ||
reduce_kernel_process(in_ptr_ncdh, out_ptr_ncdh, IW, 1); | ||
}); | ||
// }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return parallel loop here?
if (dst_data_size == 8) { | ||
auto src_data = reinterpret_cast<const int64_t *>(proc_ptr); | ||
auto dst_data = reinterpret_cast<int64_t *>(out_ptr); | ||
parallel_for2d(DIM0, stride1, [&](size_t b, size_t j) { | ||
auto src_off = b * stride0 + j * DIM1; | ||
auto dst_off = b * stride0 + j; | ||
for (size_t dim1 = 0; dim1 < DIM1; dim1++) { | ||
dst_data[dst_off] = src_data[src_off]; | ||
src_off++; | ||
dst_off += stride1; | ||
} | ||
}); | ||
} else if (dst_data_size == 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a template function for this logic to avoid code duplication?
auto src_data = reinterpret_cast<const int64_t *>(proc_ptr); | ||
auto dst_data = reinterpret_cast<int64_t *>(out_ptr); | ||
parallel_for2d(DIM0, stride1, [&](size_t b, size_t j) { | ||
auto src_off = b * src_stride0 + j * blockLen; | ||
auto dst_off = b * dst_stride0 + j; | ||
for (size_t dim1 = 0; dim1 + blockLen <= DIM1; dim1 += blockLen) { | ||
for (size_t k = 0; k < blockLen; k++) { | ||
dst_data[dst_off] = src_data[src_off]; | ||
src_off++; | ||
dst_off += stride1; | ||
} | ||
src_off += (stride1 - 1) * blockLen; | ||
} | ||
size_t tail = DIM1 % blockLen; | ||
for (size_t k = 0; k < tail; k++) { | ||
dst_data[dst_off] = src_data[src_off]; | ||
src_off++; | ||
dst_off += stride1; | ||
} | ||
}); | ||
} else if (dst_data_size == 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same: it would be great if we had some template function for these computations
const auto input_prec = getOriginalInputPrecisionAtPort(REDUCE_DATA); | ||
const auto output_prec = getOriginalOutputPrecisionAtPort(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
const auto input_prec = getOriginalInputPrecisionAtPort(REDUCE_DATA); | |
const auto output_prec = getOriginalOutputPrecisionAtPort(0); | |
const auto& input_prec = getOriginalInputPrecisionAtPort(REDUCE_DATA); | |
const auto& output_prec = getOriginalOutputPrecisionAtPort(0); |
// template <x64::cpu_isa_t isa> | ||
// void JitReduceKernel<isa>::reduce_gather(const Vmm& vmm_dst, int64_t offset) { | ||
// switch (jcp.src_el_type) { | ||
// case ov::element::f64: | ||
// case ov::element::i64: | ||
// if (isa == x64::avx512_core) { | ||
// auto ymm_idx = Ymm(v_idx.getIdx()); | ||
// auto zmm_src = Zmm(v_src.getIdx()); | ||
|
||
// kxnorq(k_mask, k_mask, k_mask); | ||
// vgatherdpd(zmm_src | k_mask, ptr[reg_src + offset + ymm_idx]); | ||
// if (jcp.src_el_type == ov::element::f64 && exec_el_type == ov::element::i64) { | ||
// vcvtpd2qq(zmm_src, zmm_src); | ||
// } else if (jcp.src_el_type == ov::element::i64 && exec_el_type == ov::element::f64) { | ||
// vcvtqq2pd(zmm_src, zmm_src); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this code commented? Should we remove or uncomment it?
// vcmppd(k_mask, vmm_src, v_zero, _cmp_neq_uq); | ||
// vblendmps(vmm_src | k_mask, v_zero, v_ones); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code
// reg_oc_off = getReg64(); | ||
// reg_post_ops_data = getReg64(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented code
|
||
this->postamble(); | ||
|
||
if (isa == x64::avx512_core) { // Only if bf16? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the comment mean? Should we check smth before the merge?
// TODO | ||
} | ||
uni_vdivpd(vmm_dst, vmm_dst, v_divider); | ||
uni_vroundpd(vmm_dst, vmm_dst, 0x3); // Truncation | ||
if (isa == x64::avx512_core) { | ||
vcvtpd2qq(vmm_dst, vmm_dst); | ||
} else { | ||
// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at the TODO sections
This PR will be closed in a week because of 2 weeks of no activity. |
This PR was closed because it has been stalled for 2 week with no activity. |
Details:
Tickets: