Add additional information to errors when query execution failed #368

dkropachev · 2024-11-18T16:18:14Z

We have had an issue with java-driver reaching end of execution plan and throwing NoNodeAvailableException.
The problem is that when user get this error there is no information in it, beside the fact that end of execution plan has been reached.

Most of the PROD environments have log rate reducing technics in place, like: log sampling, filtering, deduplication, supression.
Due to that, it is common problem that is not possible to figure out why exactly this exception was thrown by just looking at the error message and/or at the logs.
Which causes extra load on both customer engeneering team, our support and engeneering team.
To mitigate this issue in all the drivers the following is proposed to enrich error/exception with following information(pick only that is relevant for given error):

List of the nodes in the cluster (including their status,dc,rack)
List of connections to the replicas (including host, rack, dc, shard)
List of prior errors (if query has been tried to execute on one host, and was switched to another due to the error, show all these errors if end of execution plan is reached).
History of topology changes. (Nodes being UP/DOWN with timestamps)
Replica set information source (tablet/vnode/other)
Node/connection overload status (Status itself, if peresent or queries in flight)

We can include that information into any query error, or into spefic errors, such as timeouts, empty execution plan error, end of execution plan error, or no connections available error.

While doing that we should be aware that clusters potentially could have many nodes (>60) and therefore node status information should be reduced by the following logic:

Add status for nodes that are relevant to the query (based on replica set, dc, rack)
Status for the rest of the cluster we should group by dc/rack/node-status(UP/DOWN)

In order to avoid excessive load we might want to have reducing logic, say to include that information only once a minute.

The text was updated successfully, but these errors were encountered:

roydahan · 2024-11-18T16:25:20Z

We can try to include it in cassandra-stress and enjoy the extra information in QA tests.

dkropachev mentioned this issue Nov 18, 2024

Add more information to NoNodeAvailableException and AllNodesFailedException #350

Draft

roydahan assigned dkropachev Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional information to errors when query execution failed #368

Add additional information to errors when query execution failed #368

dkropachev commented Nov 18, 2024 •

edited

Loading

roydahan commented Nov 18, 2024

Add additional information to errors when query execution failed #368

Add additional information to errors when query execution failed #368

Comments

dkropachev commented Nov 18, 2024 • edited Loading

roydahan commented Nov 18, 2024

dkropachev commented Nov 18, 2024 •

edited

Loading