Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop server to reduce flakiness of UNIT_Server_TEST #2740

Open
wants to merge 3 commits into
base: gz-sim9
Choose a base branch
from

Conversation

scpeters
Copy link
Member

@scpeters scpeters commented Jan 28, 2025

🦟 Bug fix

Fixes #2609.

Summary

The ServerRepeat/ServerFixture.ServerConfigSensorPlugin/0 test was timing out flakily. This adds print statements and a server->Stop(); call to aid in shutdown.

Checklist

  • Signed all commits for DCO
  • Added tests
  • Updated documentation (as needed)
  • Updated migration guide (as needed)
  • Consider updating Python bindings (if the library has them)
  • codecheck passed (See contributing)
  • All tests passed (See test coverage)
  • While waiting for a review on your PR, please help review another open pull request to support the maintainers

Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining Signed-off-by messages.

@github-actions github-actions bot added the 🏛️ ionic Gazebo Ionic label Jan 28, 2025
@scpeters
Copy link
Member Author

Passing test console output (from macOS)

37: [ RUN      ] ServerRepeat/ServerFixture.ServerConfigSensorPlugin/0
37: (2025-01-28 16:05:32.108) [debug] [Server_TEST.cc:318] Create server
37: (2025-01-28 16:05:32.108) [info] [Server.cc:145] Loading SDF world file[/Users/jenkins/jenkins-agent/workspace/gz_sim-ci-pr_any-homebrew-amd64/gz-sim/test/worlds/air_pressure.sdf].
37: (2025-01-28 16:05:32.159) [info] [SystemManager.cc:54] Serving entity system service on [/entity/system/add]
37: (2025-01-28 16:05:33.314) [debug] [SystemManager.cc:80] Loaded system [gz::sim::systems::AirPressure] for entity [1]
37: (2025-01-28 16:05:33.323) [debug] [SystemManager.cc:80] Loaded system [gz::sim::systems::SceneBroadcaster] for entity [1]
37: (2025-01-28 16:05:33.323) [info] [LevelManager.cc:641] Loaded level [default]
37: (2025-01-28 16:05:33.662) [debug] [TestSensorSystem.hh:49] Constructing TestSensorSystem
37: (2025-01-28 16:05:33.662) [debug] [TestSensorSystem.hh:69] Configuring TestSensorSystem
37: (2025-01-28 16:05:33.662) [debug] [SystemManager.cc:80] Loaded system [gz::sim::TestSensorSystem] for entity [8]
37: (2025-01-28 16:05:33.662) [info] [SimulationRunner.cc:260] Serving world controls on [/world/air_pressure_sensor/control], [/world/air_pressure_sensor/control/state] and [/world/air_pressure_sensor/playback/control]
37: (2025-01-28 16:05:33.662) [info] [SimulationRunner.cc:286] Serving GUI information on [/world/air_pressure_sensor/gui/info]
37: (2025-01-28 16:05:33.662) [info] [SimulationRunner.cc:289] World [air_pressure_sensor] initialized with [fast] physics profile.
37: (2025-01-28 16:05:33.662) [info] [SimulationRunner.cc:296] Serving world SDF generation service on [/world/air_pressure_sensor/generate_world_sdf]
37: (2025-01-28 16:05:33.662) [info] [ServerPrivate.cc:334] Serving world names on [/gazebo/worlds]
37: (2025-01-28 16:05:33.662) [info] [ServerPrivate.cc:347] Resource path add service on [/gazebo/resource_paths/add].
37: (2025-01-28 16:05:33.662) [info] [ServerPrivate.cc:360] Resource path get service on [/gazebo/resource_paths/get].
37: (2025-01-28 16:05:33.663) [info] [ServerPrivate.cc:375] Resource path resolve service on [/gazebo/resource_paths/resolve].
37: (2025-01-28 16:05:33.663) [info] [ServerPrivate.cc:389] Resource paths published on [/gazebo/resource_paths].
37: (2025-01-28 16:05:33.663) [info] [ServerPrivate.cc:402] Server control service on [/server_control].
37: (2025-01-28 16:05:33.663) [debug] [Server_TEST.cc:326] Run server
37: (2025-01-28 16:05:33.663) [debug] [Server_TEST.cc:332] Request service
37: (2025-01-28 16:05:33.663) [debug] [Server_TEST.cc:339] Requesting /test/service/sensor
37: (2025-01-28 16:05:33.663) [debug] [TestSensorSystem.hh:59] TestSensorSystem service called
37: (2025-01-28 16:05:33.663) [debug] [Server_TEST.cc:341] Response: {executed: 1, result: 1}
37: (2025-01-28 16:05:33.663) [debug] [Server_TEST.cc:346] Shutting down
37: (2025-01-28 16:05:33.663) [info] [SimulationRunner.cc:743] Found no publishers on /stats, adding root stats topic
37: (2025-01-28 16:05:33.663) [info] [SimulationRunner.cc:777] Found no publishers on /clock, adding root clock topic
37: (2025-01-28 16:05:33.676) [debug] [TestSensorSystem.hh:54] Destroying TestSensorSystem
37: [       OK ] ServerRepeat/ServerFixture.ServerConfigSensorPlugin/0 (1579 ms)

@scpeters
Copy link
Member Author

Failing test console output

37: [ RUN      ] ServerRepeat/ServerFixture.ServerConfigSensorPlugin/0
37: (2025-01-29 00:22:19.143) [debug] [Server_TEST.cc:318] Create server
37: (2025-01-29 00:22:19.150) [info] [Server.cc:145] Loading SDF world file[/home/jenkins/workspace/gz_sim-ci-pr_any-noble-amd64/gz-sim/test/worlds/air_pressure.sdf].
37: (2025-01-29 00:22:19.210) [info] [SystemManager.cc:54] Serving entity system service on [/entity/system/add]
37: (2025-01-29 00:22:19.213) [debug] [SystemManager.cc:80] Loaded system [gz::sim::systems::AirPressure] for entity [1]
37: (2025-01-29 00:22:19.213) [debug] [SystemManager.cc:80] Loaded system [gz::sim::systems::SceneBroadcaster] for entity [1]
37: (2025-01-29 00:22:19.214) [info] [LevelManager.cc:641] Loaded level [default]
37: (2025-01-29 00:22:19.214) [debug] [TestSensorSystem.hh:49] Constructing TestSensorSystem
37: (2025-01-29 00:22:19.214) [debug] [TestSensorSystem.hh:69] Configuring TestSensorSystem
37: (2025-01-29 00:22:19.215) [debug] [SystemManager.cc:80] Loaded system [gz::sim::TestSensorSystem] for entity [8]
37: (2025-01-29 00:22:19.215) [info] [SimulationRunner.cc:260] Serving world controls on [/world/air_pressure_sensor/control], [/world/air_pressure_sensor/control/state] and [/world/air_pressure_sensor/playback/control]
37: (2025-01-29 00:22:19.215) [info] [SimulationRunner.cc:286] Serving GUI information on [/world/air_pressure_sensor/gui/info]
37: (2025-01-29 00:22:19.215) [info] [SimulationRunner.cc:289] World [air_pressure_sensor] initialized with [fast] physics profile.
37: (2025-01-29 00:22:19.215) [info] [SimulationRunner.cc:296] Serving world SDF generation service on [/world/air_pressure_sensor/generate_world_sdf]
37: (2025-01-29 00:22:19.215) [info] [ServerPrivate.cc:334] Serving world names on [/gazebo/worlds]
37: (2025-01-29 00:22:19.215) [info] [ServerPrivate.cc:347] Resource path add service on [/gazebo/resource_paths/add].
37: (2025-01-29 00:22:19.215) [info] [ServerPrivate.cc:360] Resource path get service on [/gazebo/resource_paths/get].
37: (2025-01-29 00:22:19.215) [info] [ServerPrivate.cc:375] Resource path resolve service on [/gazebo/resource_paths/resolve].
37: (2025-01-29 00:22:19.215) [info] [ServerPrivate.cc:389] Resource paths published on [/gazebo/resource_paths].
37: (2025-01-29 00:22:19.216) [info] [ServerPrivate.cc:402] Server control service on [/server_control].
37: (2025-01-29 00:22:19.216) [debug] [Server_TEST.cc:326] Run server
37: (2025-01-29 00:22:19.216) [debug] [Server_TEST.cc:332] Request service
37: (2025-01-29 00:22:19.216) [debug] [Server_TEST.cc:339] Requesting /test/service/sensor
37: (2025-01-29 00:22:19.216) [debug] [TestSensorSystem.hh:59] TestSensorSystem service called
37: (2025-01-29 00:22:19.216) [debug] [Server_TEST.cc:341] Response: {executed: 1, result: 1}
37: (2025-01-29 00:22:19.216) [debug] [Server_TEST.cc:346] Shutting down
37: (2025-01-29 00:22:19.216) [info] [SimulationRunner.cc:743] Found no publishers on /stats, adding root stats topic
37: (2025-01-29 00:22:19.216) [info] [SimulationRunner.cc:777] Found no publishers on /clock, adding root clock topic
37: (2025-01-29 00:22:19.216) [debug] [SimulationRunner.cc:542] Creating PostUpdate worker threads: 3
37: (2025-01-29 00:22:19.216) [debug] [SimulationRunner.cc:553] Creating postupdate worker thread (0)
37: (2025-01-29 00:22:19.216) [debug] [SimulationRunner.cc:553] Creating postupdate worker thread (1)
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:634] Serving scene information on [/world/air_pressure_sensor/scene/info]
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:643] Serving graph information on [/world/air_pressure_sensor/scene/graph]
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:654] Serving full state on [/world/air_pressure_sensor/state]
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:663] Serving full state (async) on [/world/air_pressure_sensor/state_async]
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:671] Publishing scene information on [/world/air_pressure_sensor/scene/info]
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:680] Publishing entity deletions on [/world/air_pressure_sensor/scene/deletion]
37: (2025-01-29 00:22:19.217) [info] [SceneBroadcaster.cc:689] Publishing state changes on [/world/air_pressure_sensor/state]
37: (2025-01-29 00:22:19.218) [info] [SceneBroadcaster.cc:700] Publishing pose messages on [/world/air_pressure_sensor/pose/info]
37: (2025-01-29 00:22:19.217) [debug] [AirPressureSensor.cc:126] Air pressure for [air_pressure_model::link::air_pressure_sensor] advertised on [world/air_pressure_sensor/model/air_pressure_model/link/link/sensor/air_pressure_sensor/air_pressure]
37: (2025-01-29 00:22:19.218) [info] [SceneBroadcaster.cc:711] Publishing dynamic pose messages on [/world/air_pressure_sensor/dynamic_pose/info]
 37/324 Test  #37: UNIT_Server_TEST ........................................***Timeout 240.10 sec

@iche033
Copy link
Contributor

iche033 commented Jan 30, 2025

looks like a race condition. If the systems' PostUpdate thread started running, the test gets stuck shutting down. Maybe one thing to try is not load the scene_broadcaster system as it does not look like it's needed for this test (and probably need to change the expectation of system count to 2u)

@azeey
Copy link
Contributor

azeey commented Jan 30, 2025

looks like a race condition. If the systems' PostUpdate thread started running, the test gets stuck shutting down. Maybe one thing to try is not load the scene_broadcaster system as it does not look like it's needed for this test (and probably need to change the expectation of system count to 2u)

It would be nice if we could could fix the race condition though.

Signed-off-by: Steve Peters <[email protected]>
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (gz-sim9@06e16bc). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             gz-sim9    #2740   +/-   ##
==========================================
  Coverage           ?   68.94%           
==========================================
  Files              ?      345           
  Lines              ?    33349           
  Branches           ?        0           
==========================================
  Hits               ?    22994           
  Misses             ?    10355           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scpeters scpeters changed the title Add debug console messages to Server_TEST.cc Stop server to reduce flakiness of UNIT_Server_TEST Mar 5, 2025
@scpeters scpeters marked this pull request as ready for review March 5, 2025 16:59
@scpeters scpeters requested a review from mjcarroll as a code owner March 5, 2025 16:59
@ntfshard
Copy link
Contributor

ntfshard commented Mar 5, 2025

looks like a race condition. If the systems' PostUpdate thread started running, the test gets stuck shutting down. Maybe one thing to try is not load the scene_broadcaster system as it does not look like it's needed for this test (and probably need to change the expectation of system count to 2u)

It would be nice if we could could fix the race condition though.

Running test with thread sanitizer, like valgrind --tool=drd can show quite a lot of problems in a different places.
I tried to figure out how to fix it and seems it or should include re-design or quite big bunch of mutexes. (don't have too much free time actually, sorry)

According to my observations, threads are stuck on a barrier, while rest of the code are starting destroying in the end of test and eventually it stuck in destructor, on thread join operation.

#4  ___pthread_cond_wait (cond=0x5f3cb3736418, mutex=0x5f3cb37363f0) at ./nptl/pthread_cond_wait.c:627
#5 in gz::sim::v9::Barrier::Wait (this=0x5f3cb3735e10) at /usr/include/c++/13/bits/unique_ptr.h:199
#6 in gz::sim::v9::SimulationRunner::UpdateSystems (this=0x5f3cb20b90b0) at /gz-sim/src/SimulationRunner.cc:626

Line number may be skewed(this log is a couple weeks old), but it's inside SimulationRunner::UpdateSystems(), in post-update section

Still this bug has not the best reproducibility, sometimes executable should be restarted more than 100 times to reproduce hang

EXPECT_TRUE(executed);
EXPECT_TRUE(result);
EXPECT_EQ("TestSensorSystem", rep.data());
server.Stop();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ~ServerPrivate is already calling Stop()

I'm not sure if this actually fixes anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏛️ ionic Gazebo Ionic
Projects
Status: In progress
4 participants