Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the way call-IDs are tracked in the SIP plugin (fixes #3404) #3443

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

lminiero
Copy link
Member

@lminiero lminiero commented Oct 4, 2024

This PR is an attempt to address the problem @ycherniavskyi reported in issue #3404. We have to thank Yurii for this fix as well, as his company, @WebTrit, sponsored the effort to investigate the issue and come up with a potential solution. Thanks, guys!

Coming to the core of the issue, Yurii identified an inconsistency in how Call-IDs were used as ways to address sessions in the SIP plugin, that would occur specifically when caller and callee were handled by the same Janus instance, and so the map would be overwritten. This could cause problems in two specific cases:

  1. When using event handlers to notify incoming/outgoing SIP messages to an external backend (e.g., Homer) and mapping them to the plugin session (and so Janus handle) they belonged to.
  2. When doing attended SIP transfers to fill in the Replaces header properly.

This PR introduces a new kind of mapping where we don't map the call-ID to a single session, but instead map it to a new structure that can keep track of both a caller and a callee, by keeping track of caller tags too. That structure is only updated with info on sessions handled by the Janus instance, and allows us to track both parties in a call from the same call-ID. This allowed us to solve the two problems above in two separate ways:

  1. For event handlers, we find the mapping from call-ID, and then check From/To headers to find the right session looking at tags in the request.
  2. For attended transfers, we limit the scope of lookups to sessions handled by the session implementing the transfer: in fact, transfers in Janus rely on the concept of master/helper sessions to implement multiple lines, and limiting the scope to those lines allows us to avoid ambiguities.

As usual, while this PR is currently for master, I'll backport it to 0.x once it's merged. I tested the PR briefly and it seems to work, but of course my ability to test SIP deployments (and those use cases in particular) is limited. I'm looking forward to feedback, to ensure that first of all this fixes the two problems we mentioned, but also that we're not introducing regressions with these changes. If you use SIP extensively in your deployments, please make sure to test this and give us all the feedback we need to make sure it's merged as soon as possible.

@lminiero
Copy link
Member Author

lminiero commented Oct 4, 2024

Pinging @lmangani as well, since before this fix SIP messages sent to Homer may have been misattributed by the core.

@ycherniavskyi
Copy link
Contributor

@lminiero We've successfully built Janus with this PR and are beginning our testing process. Please expect a detailed feedback report within several days.

@lminiero
Copy link
Member Author

@ycherniavskyi any feedback on this?

@ycherniavskyi
Copy link
Contributor

@lminiero, unfortunately, we encountered two problematic cases that appear to be related to the same root issue. In both instances, Janus crashed (in our environment, the Janus container exited with code 139).

First case

Two accounts, Client A and Client B, are registered to the same Janus instance via our prospect softswitch.

Steps to reproduce:

  1. Client A initiates a call to Client B.
  2. Client B receives the incoming call from Client A.
  3. Client B answers the call from Client A.
  4. The call is successfully connected and proceeds normally.
  5. Client A drops the call with Client B.
  6. Upon receiving the BYE message for Client B, Janus crashes.

Here is the call flow and detailed SIP message log for this case: Link to Gist (IP addresses, numbers, and names have been anonymized).

It’s also worth noting that this issue does not occur with all softswitches we’ve tested.

Second case

In this scenario, three accounts are registered on our prospect softswitch:

  • Account 1 (Linphone)
  • Account 2 (Janus)
  • Account 3 (any other device)

Steps to reproduce:

  1. Account 1 calls Account 2.
  2. Account 1 transfers (either unattended or attended) the call to Account 3.
  3. Account 2 and Account 3 are successfully connected and the call proceeds.
  4. When either Account 2 or Account 3 ends the call, Janus crashes.

The only requirement for this crash is that a non-Janus initiates the transfer.

Since both cases seem related to the same issue, I didn’t collect a separate SIP dump for the second case. However, if needed, I can gather it ASAP.

Let me know if you need additional information or further logs.

@lminiero
Copy link
Member Author

If there's crashes, I'll need a gdb stacktrace or, even better, a libasan dump, to see where it crashes. Just the SIP messages (which is what I see in that gist) won't help unfortunately.

@lminiero
Copy link
Member Author

@ycherniavskyi while we wait for those crash logs I mentioned, the only thing I noticed in your log is that there seem to be many SIP BYEs that are never answered, I guess because, as you pointed out, Janus seems to be crashing when it gets the first one. That said, we don't do anything in nua_i_bye except save the reason for the BYE, and that wasn't changed in this PR. I personally can't replicate this crash, so I'll need debugging info from you to look into it.

@lminiero
Copy link
Member Author

@ycherniavskyi I should have fixed the crash in the commit above: it was a double free, caused by the fact we were not strdup-ing the call-id when adding it as a key to the callids table in some cases, and so the string was freed twice (once by us, once by the table itself on removal).

@lminiero
Copy link
Member Author

Merging ✌️

@lminiero lminiero merged commit cb10621 into master Oct 24, 2024
8 checks passed
@lminiero lminiero deleted the sip-callids branch October 24, 2024 12:08
lminiero added a commit that referenced this pull request Oct 24, 2024
adnan-mujagic pushed a commit to adnan-mujagic/janus-gateway that referenced this pull request Oct 25, 2024
natikaltura pushed a commit to natikaltura/janus-gateway that referenced this pull request Nov 7, 2024
mwalbeck pushed a commit to mwalbeck/docker-janus-gateway that referenced this pull request Nov 28, 2024
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [meetecho/janus-gateway](https://github.com/meetecho/janus-gateway) | minor | `v1.2.4` -> `v1.3.0` |

---

### Release Notes

<details>
<summary>meetecho/janus-gateway (meetecho/janus-gateway)</summary>

### [`v1.3.0`](https://github.com/meetecho/janus-gateway/blob/HEAD/CHANGELOG.md#v130---2024-11-25)

[Compare Source](meetecho/janus-gateway@v1.2.4...v1.3.0)

-   Refactored logging internals \[[PR-3428](meetecho/janus-gateway#3428)]
-   Use strtok to parse SDPs \[[PR-3424](meetecho/janus-gateway#3424)]
-   Fixed rare condition that could lead to a deadlock in the VideoRoom \[[PR-3446](meetecho/janus-gateway#3446)]
-   Fixed broken switch when using remote publishers in VideoRoom \[[PR-3447](meetecho/janus-gateway#3447)]
-   Added SRTP support to VideoRoom remote publishers (thanks [@&#8203;spscream](https://github.com/spscream)!) \[[PR-3449](meetecho/janus-gateway#3449)]
-   Added support for generic JSON metadata to VideoRoom publishers (thanks [@&#8203;spscream](https://github.com/spscream)!) \[[PR-3467](meetecho/janus-gateway#3467)]
-   Fixed deadlock in VideoRoom when failing to open a socket for a new RTP forwarder (thanks [@&#8203;spscream](https://github.com/spscream)!) \[[PR-3468](meetecho/janus-gateway#3468)]
-   Fixed deadlock in VideoRoom caused by reverse ordering of mutex locks \[[PR-3474](meetecho/janus-gateway#3474)]
-   Fixed memory leaks when using remote publishers in VideoRoom \[[PR-3475](meetecho/janus-gateway#3475)]
-   Diluted frequency of PLI in the VideoRoom (thanks [@&#8203;natikaltura](https://github.com/natikaltura)!) \[[PR-3423](meetecho/janus-gateway#3423)]
-   Better cleanup after failed mountpoint creations in Streaming plugin \[[PR-3465](meetecho/janus-gateway#3465)]
-   Fixed compilation of AudioBridge in case libogg isn't available (thanks [@&#8203;tmatth](https://github.com/tmatth)!) \[[PR-3438](meetecho/janus-gateway#3438)]
-   Better management of call cleanup in SIP plugin \[[Issue-3430](meetecho/janus-gateway#3430)]
-   Change the way call-IDs are tracked in the SIP plugin (thanks WebTrit!) \[[PR-3443](meetecho/janus-gateway#3443)]
-   Increased maximum size of custom SIP headers \[[Issue-3459](meetecho/janus-gateway#3459)]
-   Other smaller fixes and improvements (thanks to all who contributed pull requests and reported issues!)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yOC4wIiwidXBkYXRlZEluVmVyIjoiMzkuMjguMCIsInRhcmdldEJyYW5jaCI6Im1hc3RlciIsImxhYmVscyI6W119-->

Reviewed-on: https://git.walbeck.it/walbeck-it/docker-janus-gateway/pulls/157
Co-authored-by: renovate-bot <[email protected]>
Co-committed-by: renovate-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants