Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional information to errors when query execution failed #368

Open
dkropachev opened this issue Nov 18, 2024 · 1 comment
Open
Assignees

Comments

@dkropachev
Copy link
Collaborator

dkropachev commented Nov 18, 2024

We have had an issue with java-driver reaching end of execution plan and throwing NoNodeAvailableException.
The problem is that when user get this error there is no information in it, beside the fact that end of execution plan has been reached.

Most of the PROD environments have log rate reducing technics in place, like: log sampling, filtering, deduplication, supression.
Due to that, it is common problem that is not possible to figure out why exactly this exception was thrown by just looking at the error message and/or at the logs.
Which causes extra load on both customer engeneering team, our support and engeneering team.
To mitigate this issue in all the drivers the following is proposed to enrich error/exception with following information(pick only that is relevant for given error):

  1. List of the nodes in the cluster (including their status,dc,rack)
  2. List of connections to the replicas (including host, rack, dc, shard)
  3. List of prior errors (if query has been tried to execute on one host, and was switched to another due to the error, show all these errors if end of execution plan is reached).
  4. History of topology changes. (Nodes being UP/DOWN with timestamps)
  5. Replica set information source (tablet/vnode/other)
  6. Node/connection overload status (Status itself, if peresent or queries in flight)

We can include that information into any query error, or into spefic errors, such as timeouts, empty execution plan error, end of execution plan error, or no connections available error.

While doing that we should be aware that clusters potentially could have many nodes (>60) and therefore node status information should be reduced by the following logic:

  1. Add status for nodes that are relevant to the query (based on replica set, dc, rack)
  2. Status for the rest of the cluster we should group by dc/rack/node-status(UP/DOWN)

In order to avoid excessive load we might want to have reducing logic, say to include that information only once a minute.

@roydahan
Copy link
Collaborator

We can try to include it in cassandra-stress and enjoy the extra information in QA tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants