-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerous test failures for Nav2 when using rmw_connextdds #21
Comments
Hi @ruffsl, thank you for this report, I will try to reproduce locally and start investigating. Meanwhile, a couple of comments:
These warnings are printed by this utility function in One solution would be to downgrade the warning in
I wonder if this is not the case in the ROS 2 API, in which case there is probably some more work to be done to perform a proper conversion of
This is something that I've seen before show up in some unit test logs, but I haven't been able to understand/resolve yet. I think it has to be related to this call in
I personally think The error comes from here, and it's being triggered because an object being deleted (a "publisher" or a "subscriber", maybe owned by a "client" or "service", or maybe a "guard condition") is still marked attached to a waitset that is currently in the midst of beginning a Now, if you think the deletion is legitimate (i.e. whatever object is being deleted is not currently associated with a waitset), then it might be that there is a problem in the current implementation of |
I think I may have found the reason for the |
I'm still investigating, but one thing I've noticed is that the default implementation used by I'm currently focusing on |
I've been doing some extensive testing locally (by building a "minimal" Rolling tree and the nav2 stack in Debug mode), and created #22 to address some of the issues that I was able to identify. I have been running tests from In general, the failures seem to me to be mostly related to:
There is also the possibility for these tests to run into a discovery race condition that exists for all "volatile" ROS 2 clients: there is a fundamental race condition in the client operation There is also another potential race condition which may lead to a reply not being sent by a service because the local writer had not yet matched the client's reader when the reply was written. This is addressed by the DDS-RPC specification by prescribing that the "reply writer" always be matched by an implementation before its associated "request reader" when matching the endpoint's of a remote client, but this requires the propagation of additional information via discovery, and, as far as I know, no DDS vendor yet implements this behavior. The solution to the first race condition is, imho, simple: always use at least "transient local" durability for clients. But maybe I'm not considering some adverse effect of that decision. Anyway, these race conditions should only be relevant for applications which create and delete clients very often, since that triggers discovery and matching. This shouldn't be the case for a "typical" DDS application (because endpoint creation/deletion is an expensive operation, with system-wide consequences because of discovery, and should not be carried out continuously if possible), but I'm not sure if the same it's true for a ROS 2 application making extensive use of actions like the I was able to confirm that the issues with "slow discovery" and "slow reliability" can be mitigated by tuning the QoS via XML (since the required parameters are not exposed by ROS). Increasing the "announcement period" makes component able to discover each other more quickly, while increasing the "heartbeat period" will help making sure that writers will be quicker to repair missed samples. I did some experiments running with the XML configuration below saved as At this point, I'm validating the changes in #22 and look forward to have them in After that, I guess we'll run them through another set of your CI and see what comes out of it. <?xml version="1.0" encoding="UTF-8"?>
<dds xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://community.rti.com/schema/6.0.1/rti_dds_profiles.xsd" version="6.0.1">
<!-- Qos Library -->
<qos_library name="Issue21">
<qos_profile name="BaseProfile" base_name="BuiltinQosLib::Generic.Common" is_default_qos="true">
<participant_qos>
<discovery_config>
<initial_participant_announcements>10</initial_participant_announcements>
<min_initial_participant_announcement_period>
<sec>0</sec>
<nanosec>200000000</nanosec>
</min_initial_participant_announcement_period>
<max_initial_participant_announcement_period>
<sec>0</sec>
<nanosec>400000000</nanosec>
</max_initial_participant_announcement_period>
</discovery_config>
</participant_qos>
<datareader_qos>
<protocol>
<rtps_reliable_reader>
<min_heartbeat_response_delay>
<sec>0</sec>
<nanosec>0</nanosec>
</min_heartbeat_response_delay>
<max_heartbeat_response_delay>
<sec>0</sec>
<nanosec>0</nanosec>
</max_heartbeat_response_delay>
</rtps_reliable_reader>
</protocol>
</datareader_qos>
<datawriter_qos>
<protocol>
<rtps_reliable_writer>
<low_watermark>0</low_watermark>
<high_watermark>2</high_watermark>
<heartbeats_per_max_samples>100</heartbeats_per_max_samples>
<min_send_window_size>10</min_send_window_size>
<max_send_window_size>10</max_send_window_size>
<max_heartbeat_retries>500</max_heartbeat_retries>
<heartbeat_period>
<sec>0</sec>
<nanosec>10000000</nanosec>
</heartbeat_period>
<fast_heartbeat_period>
<sec>0</sec>
<nanosec>500000</nanosec>
</fast_heartbeat_period>
<late_joiner_heartbeat_period>
<sec>0</sec>
<nanosec>500000</nanosec>
</late_joiner_heartbeat_period>
<max_nack_response_delay>
<sec>0</sec>
<nanosec>0</nanosec>
</max_nack_response_delay>
</rtps_reliable_writer>
</protocol>
</datawriter_qos>
</qos_profile>
</qos_library>
</dds> |
I forgot to mention that the issue with the "typesupport" error is caused by an error message in I had opened a PR to remove it (rosidl_typesupport#107) but closed it for now, because it seems like @clalancette has already one (draft) open to address the same issue (rosidl_typesupport#102). The other one about "saturated time" will be removed by the alternative WaitSet implementation. It is nevertheless curious because I kept seeing it pop up with the other implementation even after re-implementing the logic to convert |
* Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]>
* Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]>
* Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]>
* Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]>
* Resolve issues identified while investigating #21 (#22) * Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]> * Remove unavailable time helpers Signed-off-by: Andrea Sorbini <[email protected]>
* Resolve issues identified while investigating #21 (#22) * Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]> * Remove unavailable time helpers Signed-off-by: Andrea Sorbini <[email protected]> * Remove unsupported QoS event Signed-off-by: Andrea Sorbini <[email protected]>
* Resolve issues identified while investigating #21 (#22) * Use C++ std WaitSets by default Signed-off-by: Andrea Sorbini <[email protected]> * Use Rolling in README's Quick Start Signed-off-by: Andrea Sorbini <[email protected]> * Improved implementation of client::is_service_available for Connext Pro Signed-off-by: Andrea Sorbini <[email protected]> * Only add request header to typecode with Basic req/rep profile Signed-off-by: Andrea Sorbini <[email protected]> * Remove commented/unused code Signed-off-by: Andrea Sorbini <[email protected]> * Avoid topic name validation in get_info functions Signed-off-by: Andrea Sorbini <[email protected]> * Reduce shutdown period to 10ms Signed-off-by: Andrea Sorbini <[email protected]> * Pass HistoryQosPolicy to graph cache * Reset error string after looking up type support Signed-off-by: Andrea Sorbini <[email protected]> * Remove DDS-based WaitSet implementation Signed-off-by: Andrea Sorbini <[email protected]> * Remove unavailable time helpers Signed-off-by: Andrea Sorbini <[email protected]> * Remove unsupported QoS event Signed-off-by: Andrea Sorbini <[email protected]>
@ruffsl we merged #22 to address some of the issues that you reported. The commit should have been included in the latest nightly, but I'm not sure if it was picked up by the I doubt things will ever work "out of the box" with the default reliability protocol settings, so I doubt plans will get much greener, but I would appreciate some feedback on the latest modifications. It would be helpful if you (or anyone with deeper knowledge of the Looking forward to any input you may have on the matter. I'd be happy to continue to work towards getting all tests to pass, but at this point I'm not sure how to proceed to squash the remaining issues. |
With the cron job cadence between ci.ros2.org, the building of the upstream
Thanks for all the contributions thus far. Once the changes have propagated and we have some new nightly job results to inspect, I'll dig through the test logs again to see what may be performing differently between tests within the matrix of RMW jobs. |
@ruffsl looks like there were some clear improvements ✌🏼 https://app.circleci.com/pipelines/github/ros-planning/navigation2/5182/workflows/6c36e959-ed6c-4c22-ae1d-e07ac1418e13/jobs/19097 |
Indeed, from 15 failures down to just 5; a great improvement. Skimming through the error logs, it seems like there may be some liveliness issues with bond, like missed heartbeat, as well as dropped messages, like for configuring the costmaps.
@SteveMacenski , are there any current tickets that discuss these symptoms, or are these unexhibited in other RMWs? Any suggestions on narrowing down the cause? |
I have never seen this with any other middlewares. I think if we did, I'd be flooded with issues being submitted about it from users (as we have many using both Fast-DDS and Cyclone). Over the ~1 year the bond connections have been in Nav2, I have yet to see it ever trigger on me incorrectly |
These might have something to do with delays in discovery caused by the default participant announcement settings (5 initial announcements, sent once every second, and then once every 30s, see API docs for relevant QoS). They may also be related to the slow reliability protocol settings that Connext uses by default, which only send an RTPS Heartbeat every 3s, so they are definitely suboptimal (see #26). I suspect the latter to be more relevant to these issues, since it addresses "user endpoints", and because I think (and please correct me if I'm wrong) that these "heartbeat" and "bring up" phases are application-level events (i.e. implemented by
I have seen this error occur even with faster reliability settings, and I haven't been able to determine the exact reason for it. I have also seen it fail with other RMW implementations too on my local machine. I'm not sure what the symptom of "plan intersects with keepout zone" says about underlying issues related to middleware entities (e.g. a data-reader might not be receiving data in time? or some reply from a service is lost...). Also, there might be multiple processes responsible for avoiding that error, but I'm not sure which ones these would be among the ones created by test, so it would help me to focus my investigations if you could tell me a bit more about why this error would show up in general, and if you have any more insight on this specific failure (e.g. you can pinpoint some clearly misbehaving component from the logs). |
Yes!! P.S.: |
Because I ported it because the DDS eq. weren't exposed yet |
System Info
rti-connext-dds-5.3.1/focal,now 5.3.1-nc.x64Linux3gcc5.4.0+2 amd64
Bug Description
Numerous failures in Nav2 when testing with
rmw_connextdds
that don't occur with other RMWs. To list a few:See test failure logs for complete error messages:
https://app.circleci.com/pipelines/github/ros-planning/navigation2/5045/workflows/062fb8a4-bf43-4bbe-8dbe-4cf54a698abc/jobs/18610
Expected Behavior
No test failures encountered like with other alternative RMW.
How to Reproduce
Workarounds
Switching to alternative RMWs such as
rmw_fastrtps_cpp
orrmw_cyclonedds_cpp
.Additional context
With the recent release announcement, the nav2 project has re-enabled
rmw_connextdds
to its rmw matrix of nightly tests.The text was updated successfully, but these errors were encountered: