Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 10 shared memory related crash/hang #713

Open
headlee opened this issue Sep 21, 2023 · 6 comments
Open

Windows 10 shared memory related crash/hang #713

headlee opened this issue Sep 21, 2023 · 6 comments
Labels
more-information-needed Further information is required

Comments

@headlee
Copy link

headlee commented Sep 21, 2023

Bug report

Required Info:

  • Operating System: Windows 10

  • Installation type: Binaries

  • Version or commit hash: Humble (Patch Release 4, 43083e5)

  • DDS implementation: Fast-RTPS

  • Client library (if applicable): rclcpp

Steps to reproduce issue

Sometimes when we run our launchfile one or more of our nodes will immediately die or seemingly hang. It seems to be caused by using Fast-RTPS shared memory on Windows...we are able to get around some of these issues by doing fastdds.bat shm clean but today we got the error pasted below and aren't sure how to proceed.

If anyone has any advice to make this reproducible and/or knows how to fix it that would be great. We're hoping that a new Windows Humble patch release with Fast-RTPS/Fast-DDS version bumps (which should solve this) could also help us out, but any advice in the meantime is appreciated!

Additional information

Error message:

[robot_controller.EXE-2] A subdirectory or file C:\ProgramData\eprosima\fastrtps_interprocess already exists.
[robot_controller.EXE-2] 
[robot_controller.EXE-2] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[robot_controller.EXE-2] This error state is being overwritten:
[robot_controller.EXE-2] 
[robot_controller.EXE-2]   '__create_participant failed to create participant, at C:\ci\ws\src\ros2\rmw_fastrtps\rmw_fastrtps_shared_cpp\src\participant.cpp:105, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:264'
[robot_controller.EXE-2] 
[robot_controller.EXE-2] with this new error message:
[robot_controller.EXE-2] 
[robot_controller.EXE-2]   'rcl node's rmw handle is invalid, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:416'
[robot_controller.EXE-2] 
[robot_controller.EXE-2] rcutils_reset_error() should be called after error handling to avoid this.
[robot_controller.EXE-2] <<<
[robot_controller.EXE-2] [ERROR] [1695302197.677919300] [rcl]: Failed to fini publisher for node: 1
[ERROR] [robot_controller.EXE-2]: process has died [pid 2704, exit code 3221226505, cmd 'C:\ws\resources\ros2_backend\ros_nodes\install\lib\robodk_yaskawa\robot_controller.EXE --ros-args -r __node:=robot_controller --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_a68uie0v --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_c4v5b50e --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_4v7q3bh_'].

Fast-RTPS profile:

<?xml version="1.0" encoding="UTF-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
  <profiles>
    
    <transport_descriptors>
      
      <!-- We still want the default UDP transport to be present -->
      <transport_descriptor>
        <transport_id>udp_transport</transport_id>
        <type>UDPv4</type>
      </transport_descriptor>
      
      <!-- We want a customized version of the SHM transport -->
      <transport_descriptor>
        <transport_id>shm_transport</transport_id>
        <type>SHM</type>
        <!--
            Configure the size in bytes of the shared memory segment.
        -->
        <segment_size>100000000</segment_size>
        <!--
            The following value should be lower than or equal to segment_size.
            The default value is shown here.
        -->
        <maxMessageSize>100000000</maxMessageSize>
      </transport_descriptor>

    </transport_descriptors>

    
    <!-- Default publisher profile -->
    <publisher profile_name="default publisher profile" is_default_profile="true">
      <qos>
        <publishMode>
          <kind>ASYNCHRONOUS</kind>
        </publishMode>
      </qos>
      

      
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </publisher>
    <participant profile_name="default publisher profile" is_default_profile="true">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>
    

    <!-- Default subscriber profile -->
    <subscriber profile_name="default subscriber profile" is_default_profile="true">
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </subscriber>
    <participant profile_name="default subscriber profile" is_default_profile="true">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>

    <!-- Request subscriber profile for services -->
    <subscriber profile_name="service">
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </subscriber>
    <participant profile_name="service">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>

    <!-- Request publisher profile for clients -->
    <publisher profile_name="client">
      <qos>
        <publishMode>
          <kind>ASYNCHRONOUS</kind>
        </publishMode>
      </qos>
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </publisher>
    <participant profile_name="client">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>
  </profiles>
</dds>
@MiguelCompany
Copy link
Collaborator

There is a humble patch release pending to update Fast DDS to v2.6.6, which has several improvements related to SHM and resilience to crashing nodes.

@clalancette
Copy link
Contributor

@headlee We've now released Humble Patch release 5: https://github.com/ros2/ros2/releases/tag/release-humble-20230925 . Can you check to see if the problem you reported is fixed? Thanks.

@clalancette clalancette added the more-information-needed Further information is required label Sep 28, 2023
@headlee
Copy link
Author

headlee commented Oct 26, 2023

The new patch release has definitely helped. We still see some intermittent issues but it's unclear at this point if they are due to fastrtps, ros daemon on Windows, or something else.

Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.

@MiguelCompany
Copy link
Collaborator

Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.

@headlee No downsides. You could still do that.

@headlee
Copy link
Author

headlee commented Oct 27, 2023

Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.

@headlee No downsides. You could still do that.

Great, thanks!

One more sort of related question: do you know if it's possible to change the directory that eprosima uses to store the shm files on Windows? I found datasharingqospolicy in the docs but I don't know if that's for normal shm or just the DataReader/DataWriter delivery. Haven't had a chance to test it out yet.

@MiguelCompany
Copy link
Collaborator

do you know if it's possible to change the directory that eprosima uses to store the shm files on Windows?

@headlee I'm afraid it cannot be changed. It is fixed here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-information-needed Further information is required
Projects
None yet
Development

No branches or pull requests

3 participants