Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lower packet loss to near-zero levels #178

Closed
wants to merge 5 commits into from

Conversation

mojomex
Copy link
Collaborator

@mojomex mojomex commented Jul 22, 2024

PR Type

  • Bug Fix

Related Links

  • [TIER IV internal link] -- internal ticket

Description

This PR lowers the chance of packet loss occurring for all supported LiDARs by minimizing the number of mutex locks in the UDP receiver thread.

The UDP receiver thread was losing packets before, mainly because of scheduling:

  1. Decoder is currently reading from queue
  2. UDP receiver cannot acquire lock, its thread gets suspended
  3. N UDP packets arrive (and are lost) while thread is not scheduled
  4. When thread gets finally scheduled, it has lost packets

This is mostly mitigated by moving the mt_queue buffer from between the hardware interface and decoder to between the decoder and pointcloud/packets publishers. In other words, the queue is moved from the high-frequency part of Nebula (up to 36kHz for OT128) to the low-frequency part (10Hz or whatever the sensor framerate is set to).

The below figure shows the number of points in an unfiltered pointcloud (as a proxy for the number of packets received) for the baseline and fixed ("queue_only") implementations. The sensor tested was OT128 in dual return & high resolution mode.
The benchmarking setup was as follows (every run was repeated 3 times):

  • taskset Nebula to cores 0-7
  • taskset ros2 bag record -s mcap --max-cache-size 5000000000 /pandar_points to cores 8-11
  • run 8 instances of ros2 topic echo /pandar_points --field header taskset to core 12 + 2i for instance i
  • stop all processes after 60s

The plot was generated from the concatenated rosbags.

nebula_queue

messages mean std min 0.1% 1% 10% 50%
baseline 1715 920577 2436.64 865024 895010 911908 919040 921600
fix 1717 921585 582.054 897536 920503 921600 921600 921600
baseline (% lost) 920577 0.11 6.14 2.89 1.05 0.28 0
fix (% lost) 0.00 2.61 0.12 0 0 0

ℹ️ This PR applies to all Hesai, Velodyne, and Robosense sensors.
ℹ️ #169 (Aeva Aeries II) already uses a similar approach and should be immune to packet loss
⚠️ I am not sure at the moment whether this change would benefit Continental sensors as well (HW monitor and other functions handled in decoder, packet rate unknown to me), so I skipped them for now. @knzo25 have you observed packet loss on Radars so far?

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

  • Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.

@mojomex mojomex self-assigned this Jul 22, 2024
@mojomex mojomex force-pushed the fix-queue-packet-loss branch from ef994c7 to 70de7e8 Compare July 22, 2024 13:33
@mojomex mojomex force-pushed the fix-queue-packet-loss branch 3 times, most recently from d80dcf4 to db305df Compare July 23, 2024 05:26
@mojomex mojomex force-pushed the fix-queue-packet-loss branch from db305df to f77ee3b Compare July 23, 2024 05:28
@mojomex mojomex requested a review from drwnz July 23, 2024 06:27
@mojomex mojomex marked this pull request as ready for review July 23, 2024 06:28
@mojomex mojomex force-pushed the fix-queue-packet-loss branch from 7d153d8 to 3e2239f Compare July 23, 2024 06:31
@mojomex
Copy link
Collaborator Author

mojomex commented Aug 2, 2024

This was not needed after all and the same can be achieved by allocating larger network buffers (net.core.rmem_default).

@mojomex mojomex closed this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant