Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧑‍🌾 test_rosbag2_play_end_to_end flaky on Windows #1437

Open
Crola1702 opened this issue Aug 8, 2023 · 10 comments
Open

🧑‍🌾 test_rosbag2_play_end_to_end flaky on Windows #1437

Crola1702 opened this issue Aug 8, 2023 · 10 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Crola1702
Copy link
Contributor

Crola1702 commented Aug 8, 2023

Description

Flaky test test_rosbag2_play_end_to_end in windows CI (debug, release and repeated)

Test regressions:

Expected Behavior

Test should pass

Actual Behavior

Test failing because of a timeout

To Reproduce

  1. Run a build in nightly_win_deb, nightly_win_rel or nightly_win_rep
  2. See projectroot.test_rosbag2_play_end_to_end fail

System (please complete the following information)

  • OS: Windows
  • ROS 2 Distro: Rolling
  • Install Method: Source
  • Version: Rolling latest

Additional context

Reference build: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2813/

Test regressions:

Test is failing because of a timeout:

Log output:

[WARN] [1691491268.193064100] [subscription_manager_1691491266065392000]: SubscriptionManager::continue_spinning(..) finished by timeout
C:\ci\ws\src\ros2\rosbag2\rosbag2_tests\test\rosbag2_tests\test_rosbag2_play_end_to_end.cpp(80): error: Expected equality of these values:
  future.wait_for(service_call_timeout_)
    Which is: 4-byte object <01-00 00-00>
  std::future_status::ready
    Which is: 4-byte object <00-00 00-00>
 1/11 Test  #2: test_rosbag2_play_end_to_end .....***Timeout  60.00 sec

Test gets stuck when the error pops up (normally takes 15 seconds to run)


Flakiness ratio (last 15 days)

job_name last_time first_time build_count failure_count failure_percentage
nightly_win_rep 2023-08-16 2023-08-02 15 13 86.67
nightly_win_rel 2023-08-16 2023-08-02 15 14 93.33
nightly_win_deb 2023-08-16 2023-08-02 15 13 86.67

Updated 17-08-2023

Running a diff between ros2 repos in nightly_win_deb#2806 and nightly_win_deb#2807:

@@ -36,7 +36,7 @@ repositories:
   eProsima\Fast-DDS:
     type: git
     url: https://github.com/eProsima/Fast-DDS.git
-    version: 7cf43a62cabc3124721258f02c9257f451dd1971
+    version: 9ae27f174c1a33bf539c589ce2f1d6630052d1b0
   eProsima\foonathan_memory_vendor:
     type: git
     url: https://github.com/eProsima/foonathan_memory_vendor.git
@@ -208,7 +208,7 @@ repositories:
   ros2\mimick_vendor:
     type: git
     url: https://github.com/ros2/mimick_vendor.git
-    version: 6fcd465251c1e62b1dddabf6607712da8a141a2c
+    version: 24f0be689e525dbcf18cce910ac12bae26c07e3c
   ros2\orocos_kdl_vendor:
     type: git
     url: https://github.com/ros2/orocos_kdl_vendor.git
@@ -244,7 +244,7 @@ repositories:
   ros2\rclpy:
     type: git
     url: https://github.com/ros2/rclpy.git
-    version: 913afa019b0d60e7255bd84247dbeab4735d96ea
+    version: 5367703c9812680a563fb904b8c9d187da310bd3
   ros2\rcpputils:
     type: git
     url: https://github.com/ros2/rcpputils.git
@@ -264,7 +264,7 @@ repositories:
   ros2\rmw_connextdds:
     type: git
     url: https://github.com/ros2/rmw_connextdds.git
-    version: 1206113b6eff6d2a6557911866548ff4415c6852
+    version: b57d032cad10a68f0cb0d349a715b9e3c4475cf4
   ros2\rmw_cyclonedds:
     type: git
     url: https://github.com/ros2/rmw_cyclonedds.git
@@ -300,7 +300,7 @@ repositories:
   ros2\rosbag2:
     type: git
     url: https://github.com/ros2/rosbag2.git
-    version: ba199d05954d6e51975c47acd3725cac6267778f
+    version: fda5aea264b4183d94352a7c7caf266ae104aa52

First time this issue was seen:

ros2/rmw_connextdds#26 (comment)

@Crola1702 Crola1702 added the bug Something isn't working label Aug 8, 2023
@Crola1702
Copy link
Contributor Author

@clalancette, this is happening almost consistently in Windows builds. Do you think it's a good time to disable this test while #1342 is resolved? (it seems it would be a long time until we have updates on the fix PR)

@Crola1702 Crola1702 moved this to Blocked 🛑 in Buildfarm Issues Aug 21, 2023
@MichaelOrlov
Copy link
Contributor

@Crola1702 @clalancette I guess it hasn't been fully fixed for Windows after my overhaul in player tests per #1297.
I will try to take a look at it this week.
If I will not find a solution we can disable it for Windows again.

@MichaelOrlov
Copy link
Contributor

BTW. I have come across with this failure in one of my recent PR #1423 (comment).
My preliminary analysis was:

Failed PlayEndToEndTestFixture.play_filters_by_topic by timeout since was not able to receive confirmation about service call for player resume in 60 seconds.

@Crola1702
Copy link
Contributor Author

BTW. I have come across with this failure in one of my recent PR #1423 (comment).

I haven't seen this failure in the buildfarm 🤔

@Crola1702
Copy link
Contributor Author

I'm disabling the test in: #1452

@MichaelOrlov MichaelOrlov self-assigned this Aug 28, 2023
@Crola1702 Crola1702 moved this from Blocked 🛑 to Disabled ☑ in Buildfarm Issues Aug 28, 2023
@MichaelOrlov
Copy link
Contributor

Preliminary analysis:
There is a race condition between when service became available and when we start spining player node to be able to process service callbacks. i.e. it turns out that resume service created and on the other side we check in the test that service is available via graph and sending service request. However the rosbag2 player may not be ready to process callbacks from the services since we haven't yet even created executor for spinning player node.

The solution would be to move play/pause service creation from player constructor to the roabg2_palyer->play method.

cc: @clalancette

@clalancette
Copy link
Contributor

@MichaelOrlov Could this be another case similar to #1796 (review) ?

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented Sep 12, 2024

@clalancette Not the same. It is completely different.
Here, the problem is that we first constructing a Rosbag2 Player class in the rosbag2_py::transport_.cpp in the constructor we creating service for play/pause and this service became visible via node_graph for the test wich is waiting for it in another process. The test start sending service request for resume playback since it see that service became available. Although on another side (Player) we haven't yet started Player::Play() and(or) haven't yet created executor to spin it in the rosbag2_py::_transport.cpp.

auto player = std::make_shared<rosbag2_transport::Player>(
std::move(reader), std::move(keyboard_handler), storage_options, play_options);
rclcpp::executors::SingleThreadedExecutor exec;
exec.add_node(player);
auto spin_thread = std::thread(
[&exec]() {
exec.spin();
});
player->play();

Update:
Even if we will add wait for exec::is_spinning() before calling Player::play() it will not help much since the race also happened between constructing Player and further exec::spin() call.

@MichaelOrlov
Copy link
Contributor

It seems I found a workaround for tests.
In test before sending successful_service_request<Resume>(cli_resume_); request

// Send resume service call for player
ASSERT_TRUE(cli_resume_->wait_for_service(service_call_timeout_));
successful_service_request<Resume>(cli_resume_);

Need repeatedely request for another service IsPaused, but ignoring the failure if we will not get response. When we will get response for IsPaused from Player - it means that Rosbag2 Player is fully started and ready. It is sort of current ststus of the Player.
Will need to write some helper function similar to the

/// Send a service request, and expect it to successfully return within a reasonable timeout
template<typename Srv>
typename Srv::Response::SharedPtr successful_service_request(
typename rclcpp::Client<Srv>::SharedPtr cli,
typename Srv::Request::SharedPtr request)
{
auto future = cli->async_send_request(request);
EXPECT_EQ(future.wait_for(service_call_timeout_), std::future_status::ready);
EXPECT_TRUE(future.valid());
auto result = std::make_shared<typename Srv::Response>();
EXPECT_NO_THROW({result = future.get();});
EXPECT_TRUE(result);
return result;
}

@MichaelOrlov
Copy link
Contributor

@r7vme Help here will be appreciated.

@MichaelOrlov MichaelOrlov added the help wanted Extra attention is needed label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants