CPU spikes of existing nodes when starting new node #741

tonynajjar · 2024-02-16T12:18:50Z

Bug report

Required Info:

Operating System:
- Ubuntu 22.04
Installation type:
- binaries
Version or commit hash:
- Humble
- ros-humble-fastrtps/now 2.6.7-1jammy.20240125.204216 amd64 [installed,local]
- ros-humble-rmw-fastrtps-cpp/now 6.2.6-1jammy.20240125.215950 amd64 [installed,local]
DDS implementation:
- FastDDS

Steps to reproduce issue

Use default XML configuration (FASTRTPS_DEFAULT_PROFILES_FILE not set)
I have my robot bringup running consuming a decent amount of CPU with about 70 nodes (across several docker containers sharing network and ipc with the host if that is relevant)
Launch some node on the side (e.g. teleop, ros2 topic echo, etc...)
Witness a CPU spike

Expected behavior

No considerable CPU spike for the existant nodes

Actual behavior

CPU spikes for a few seconds for all the nodes to about double their consumption! I'm guessing it has to do with discovery?

Additional information

I quickly tried with Cyclone and did not witness the CPU spike but I would like to fix it with Fastdds if possible (otherwise will have to switch)

tonynajjar · 2024-02-16T13:50:05Z

With some experimentation I also noticed that the higher the number of existing nodes, the higher the CPU rise is when an extra node is added to the network

fujitatomoya · 2024-02-16T17:21:21Z

@tonynajjar thanks for creating issue. we have been meeting the similar situation...

a couple of things,

Initial Announcements can be related to the CPU usage spike during the discovery process. Depends on network resource and reliability, and also requirement for discovery latency, but this could mitigate the CPU usage spike during discovery initial state? (i believe this setting will be also applied to Endpoint Discovery.)
Do you guys happen to use ROS 2 Security Enclaves? Enabling security brings more work like handshaking during discovery process.

I am not sure if you can use ROS 2 Fast-DDS Discovery Server since it changes the architecture, either acceptable or not this will reduce the discovery cost significantly.

fujitatomoya · 2024-02-16T17:21:33Z

CC: @MiguelCompany @EduPonz

tonynajjar · 2024-02-17T16:30:12Z

Thanks for your answer @fujitatomoya.

We do not use security enclaves.
I did quickly try to make the Discovery server work to confirm it was a discovery issue but failed to do so for some reason; maybe because of my docker setup, not sure. But anyway on the long run I'd like to avoid using the discovery server (no strong reason but it feels like going back to the ROS1 centralized approach which was criticized and changed in ROS2)
Regarding the Initial Announcements, are you proposing testing out something with the config? I'm really not a DDS configuration expert (as most roboticists) so you have to spell it out for me 😅

tonynajjar · 2024-02-17T16:35:59Z

Depends on network resource and reliability, and also requirement for discovery latency

Your comment reminded me to clarify that all the nodes are running on one machine so I guess the issue can't be caused by a suboptimal network.

fujitatomoya · 2024-02-17T18:32:02Z

Regarding the Initial Announcements, are you proposing testing out something with the config?

i think you can create DEFAULT_FASTRTPS_PROFILES.xml in the running directory where you issue ros2 run xxx, and it should be loaded to the Fast-DDS. (initial announcement count is 1 from 5 and period is changed into 500 msec from 100 msec below.)

<participant profile_name="participant_profile_simple_discovery">
    <rtps>
        <builtin>
            <discovery_config>
                <initialAnnouncements>
                    <count>1</count>
                    <period>
                        <nanosec>500000000</nanosec>
                    </period>
                </initialAnnouncements>
            </discovery_config>
        </builtin>
    </rtps>
</participant>

my expectation here is,

with existed 70 ROS 2 context (70 Participants), new ROS 2 node (context) will send initial discovery packet 5 times with 100 msec periods in default. and each of 70 receivers get these packets then sends back the own participant's information every time. this could generate the CPU usage spike. (reliable and good latency discovery, but expensive?)
if we have all nodes in localhost, network is reliable enough. so we could just send a single shot initial discovery for each participant for initial announcement?

anyway, i would like to have the opinion from eProsima.
hopefully this helps,

EduPonz · 2024-02-18T13:15:55Z

Thanks @fujitatomoya, this is indeed what I would have suggested to try out as well. Please @tonynajjar do let us know how it goes.

tonynajjar · 2024-02-19T08:15:03Z

Thank you for your recommendation. Unfortunately it did not work. All the nodes in my localhost network are running this configuration:

<?xml version="1.0" encoding="UTF-8" ?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <participant profile_name="participant_profile_simple_discovery" is_default_profile="true">
        <rtps>
            <builtin>
                <discovery_config>
                    <initialAnnouncements>
                        <count>1</count>
                        <period>
                            <nanosec>500000000</nanosec>
                        </period>
                    </initialAnnouncements>
                </discovery_config>
            </builtin>
        </rtps>
    </participant>
</profiles>

I still get the CPU spike

fujitatomoya · 2024-02-20T18:03:10Z

@tonynajjar i am curious, what command did you use for this verification? e.g ros2 topic xxx without daemon starting?

tonynajjar · 2024-02-20T18:51:19Z

@tonynajjar i am curious, what command did you use for this verification? e.g ros2 topic xxx without daemon starting?

I just started some custom teleop node. But I think 'ros2 topic echo xxx' would also cause the spike; it has in the past

tonynajjar · 2024-02-26T15:18:00Z

Any alternative solutions I could try? Could someone of the maintainers try to reproduce this so that we at least know for sure that this is not a local/configuration issue? If we can confirm this, I think this bug deserves some high-prio attention, as for applications already reaching the limits of CPU consumption, this bug would be a deal breaker for using fastdds

fujitatomoya · 2024-02-26T21:33:57Z

@tonynajjar
CC: @EduPonz

I still get the CPU spike

i think there is still spike after the configuration is applied, but expecting spike period should be mitigated and CPU consumption comes down quicker than before? if you are seeing the no difference, maybe configuration is not applied. make sure that DEFAULT_FASTRTPS_PROFILES.xml in the running directory where you issue ros2 run xxx.

something else i would try is to disable the shared memory transport.
our experience tells that shared memory transport provides good performance and latency, but uses more CPU resources in the application. if shared memory transport is disabled, it takes advantage of the network interface resource.

<?xml version="1.0" encoding="UTF-8" ?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <transport_descriptors>
        <transport_descriptor>
            <transport_id>udp_transport</transport_id>
            <type>UDPv4</type>
        </transport_descriptor>
    </transport_descriptors>

    <participant profile_name="UDPParticipant">
        <rtps>
            <userTransports>
                <transport_id>udp_transport</transport_id>
            </userTransports>
            <useBuiltinTransports>false</useBuiltinTransports>
        </rtps>
    </participant>
</profiles>

if anything above does not work, that is out of my league...

tonynajjar · 2024-02-26T21:49:52Z

Thank you for your answer. I'm pretty sure that the configuration was applied; I made sure by making a typo and seeing errors when I launch the nodes.
I didn't really see much difference, maybe I didn't look in great detail but even if the spike goes away quicker than before, having it in the first place is not really acceptable for my application.

Regarding disabling Shared Memory, I think I tried that already but I can't remember for sure; I'll give it another shot in the next few days.

I'd appreciate if someone could try reproducing it. I'll try to create a minimal reproducible launch file, e.g. launching 40 talkers and 40 listeners.

tonynajjar · 2024-03-06T12:29:49Z

from launch import LaunchDescription
from launch_ros.actions import Node

def generate_launch_description():
    # Initialize an empty list to hold all the nodes
    nodes = []

    # Define the number of talkers and listeners
    num = 40

    # Create talker nodes
    for i in range(num):
        talker_node = Node(
            package='demo_nodes_cpp',
            executable='talker',
            namespace='talker_' + str(i),  # Use namespace to avoid conflicts
            name='talker_' + str(i)
        )
        nodes.append(talker_node)

    # Create listener nodes
    for i in range(num):
        listener_node = Node(
            package='demo_nodes_cpp',
            executable='listener',
            namespace='listener_' + str(i),  # Use namespace to avoid conflicts
            name='listener_' + str(i),
        remappings=[
            (f"/listener_{str(i)}/chatter", f"/talker_{str(i)}/chatter"),
        ],
        )
        nodes.append(listener_node)

    # Create the launch description with all the nodes
    return LaunchDescription(nodes)

Here is a launch file for you to reproduce the issue. After this is launched, run ros2 run demo_nodes_cpp listener in another terminal and see with htop that the CPU of all nodes get multiplied by 2-3.
Because the initial CPU usage of these nodes is not so big, the CPU jump is not so noticeable but from what I tested earlier, this scales when the initial CPU usage is already high.

tonynajjar · 2024-04-22T14:10:15Z

@fujitatomoya or @EduPonz were you able to reproduce the issue with the example I provided? It would be already useful if I can confirm whether or not this is a bug or suboptimal configuration from my side

fujitatomoya · 2024-04-22T16:20:50Z

@tonynajjar sorry for being late to get back to you. we have know this situation, i did not use your example, but having more than 100 nodes generates the CPU spike for a few seconds. as we already know, this is because of the participant discovery.

i am not sure any other configuration would work to mitigate this transient CPU load...

bochen87 · 2024-10-02T15:15:33Z

We have the same issue. We use SHM since we have components in a container that are exchanging large pointcloud data, it seems to perform better and more efficient with that. However, if we launch other nodes later on, for example debug tools, UI, etc. it does cause huge CPU spikes and causes timings to go off / heartbeats to die and software to go into error due to this. would be good to have some solution for it.

Mario-DL · 2024-10-08T12:39:03Z

Hi @tonynajjar,

We were wondering if the CPU usage spike could be related to the fact of being synchronously waiting for sockets to send the data buffers. Would it be possible for you to test with the following configuration ?

Thanks in advance

tonynajjar changed the title ~~CPU spikes when starting new node~~ CPU spikes of existing nodes when starting new node Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU spikes of existing nodes when starting new node #741

CPU spikes of existing nodes when starting new node #741

tonynajjar commented Feb 16, 2024 •

edited

Loading

tonynajjar commented Feb 16, 2024 •

edited

Loading

fujitatomoya commented Feb 16, 2024

fujitatomoya commented Feb 16, 2024

tonynajjar commented Feb 17, 2024 •

edited

Loading

tonynajjar commented Feb 17, 2024 •

edited

Loading

fujitatomoya commented Feb 17, 2024

EduPonz commented Feb 18, 2024

tonynajjar commented Feb 19, 2024

fujitatomoya commented Feb 20, 2024

tonynajjar commented Feb 20, 2024

tonynajjar commented Feb 26, 2024 •

edited

Loading

fujitatomoya commented Feb 26, 2024

tonynajjar commented Feb 26, 2024

tonynajjar commented Mar 6, 2024 •

edited

Loading

tonynajjar commented Apr 22, 2024 •

edited

Loading

fujitatomoya commented Apr 22, 2024

bochen87 commented Oct 2, 2024

Mario-DL commented Oct 8, 2024

CPU spikes of existing nodes when starting new node #741

CPU spikes of existing nodes when starting new node #741

Comments

tonynajjar commented Feb 16, 2024 • edited Loading

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

tonynajjar commented Feb 16, 2024 • edited Loading

fujitatomoya commented Feb 16, 2024

fujitatomoya commented Feb 16, 2024

tonynajjar commented Feb 17, 2024 • edited Loading

tonynajjar commented Feb 17, 2024 • edited Loading

fujitatomoya commented Feb 17, 2024

EduPonz commented Feb 18, 2024

tonynajjar commented Feb 19, 2024

fujitatomoya commented Feb 20, 2024

tonynajjar commented Feb 20, 2024

tonynajjar commented Feb 26, 2024 • edited Loading

fujitatomoya commented Feb 26, 2024

tonynajjar commented Feb 26, 2024

tonynajjar commented Mar 6, 2024 • edited Loading

tonynajjar commented Apr 22, 2024 • edited Loading

fujitatomoya commented Apr 22, 2024

bochen87 commented Oct 2, 2024

Mario-DL commented Oct 8, 2024

tonynajjar commented Feb 16, 2024 •

edited

Loading

tonynajjar commented Feb 16, 2024 •

edited

Loading

tonynajjar commented Feb 17, 2024 •

edited

Loading

tonynajjar commented Feb 17, 2024 •

edited

Loading

tonynajjar commented Feb 26, 2024 •

edited

Loading

tonynajjar commented Mar 6, 2024 •

edited

Loading

tonynajjar commented Apr 22, 2024 •

edited

Loading