Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smartswitch Dash PA Validation offload to NPU #1717

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

kperumalbfn
Copy link
Contributor

@kperumalbfn kperumalbfn commented Jun 17, 2024

Offload DASH PA Validation rules to NPU

[schema] add a set of SmartSwitch related tables - sonic-net/sonic-swss-common#947
[zmq] add proxy mode to the ZmqServer - sonic-net/sonic-swss-common#948
[DASH] add DASH offload manager and PA validation offload - sonic-net/sonic-swss#3358
[dash] add zmq_dpu_proxy_address_base parameter to telemetry.go sonic-net/sonic-gnmi#324
[DASH] enable offload manager on Nvidia SmartSwitch sonic-net/sonic-buildimage#20714

# 2 Modules Design

## 2.1 STATE_DB changes(per-DPU)
New table in STATE_DB(DASH_OFFLOAD_STATE_TABLE) is added to inform NPU about DPU SAI capability and whether NPU offload is required for certain feature. During SAI initialization, DashOrchagent queries SAI API(SAI_API_DASH_PA_VALIDATION) and sets the state_db table with the NPU offload state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this table in DPU_APPL_STATE_DB? Is this on NPU or DPU?

| | | | "false" - NPU offload not required|


## 2.2 APPL_STATE_DB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This database should be DPU_APPL_STATE_DB.

@kperumalbfn
Copy link
Contributor Author

@qiluo-msft @ganglyu @liuh-80 Please review the GNMI/ZMQ changes for Dash offload

@qiluo-msft qiluo-msft requested a review from liuh-80 November 4, 2024 19:42

## 2.1 Dash Offload Manager
The new orchagent application DashOffloadManager will be responsible for DASH offloading logic. It will collect all the needed information for offloading and perform all the relevant configurations.
To get the DASH configuration that should be offloaded, the DashOffloadManager will act as a transparent ZMQ proxy between the GNMI server and the DPU swss, forwarding all the configuration and intercepting the tables that should be offloaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the ZMQ proxy necessary? Could you subscribe to DPU_APPL_DB instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have control over what is sent to the DPU, which is a more flexible tool for implementing different kinds of offload-related functionality.

In case of PA validation offload, we don't forward the PA Validation table entries to the DPU (since DPU has no use of it). This saves some NPU<->DPU bandwidth and allows to keep the DPU side simple (no entries - nothing to create).

In the future, we can extend this infrastructure also to alter the content of the configuration (if such a need arises for some other offload feature)

Copy link
Contributor

@ganglyu ganglyu Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GNMI client-->GMMI server-->ZMQ-->DPU orchagent-->NPU DPU_APPL_STATE_DB

The DPU orchagent should update the results to the DPU_APPL_STATE_DB on the NPU. If we don't forward the PA Validation table entries to the DPU, the client won't know if this configuration failed or was intercepted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the flow described here #1759 ?
As I understand, the feedback design is not finalized yet. Once it is, I'll have to add the same logic to the offload flow to also fill the result into the DPU_APPL_STATE_DB for the offloaded (intercepted) PA validation entries. Is it ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please include this in the design?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having the same concern here.... This seem to be complicated to implement, because of a few things:

  1. The DASH offload manager or the NPU side DASH orchs need to provide the exact same feedback loop as the swss. If anything changed in swss, they need to be changed as well, which can be easily missed and causing problem.
  2. In order to support independent DPU upgrades, each DPU will needs to have its own DASH offload manager and all the DASH orchs in the NPU side.
  3. Dependency and object handling can also be a problem. This solution is trying to provide a generic way to handle all the DASH object offloading in the future, but I feel it doesn't really do the job. Explicit PA validation rule might be the simpliest case, where the only thing that we needs to do is to redirect the rules into the NPU side. However, other DASH objects can have dependencies, e.g., Implicit PA validation rules are coming from VNET + CA-PA mappings. In this case, we cannot simply redirect the rules into the NPU side, but have to copy it, because they are also used in the outbound pipeline.

Overall, I feel Gang is correct. The other way that Gang proposed there is actually much more cleaner and maintainable. All we need is just a if case in the swss, and every other things can be reused, such as feedback loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganglyu , please let us know your thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this design is necessary. If the DPU doesn't need the PA validation table and only the NPU requires it, we can configure this table directly for the NPU.

@ganglyu ganglyu requested a review from qiluo-msft November 12, 2024 01:04
@ganglyu
Copy link
Contributor

ganglyu commented Nov 12, 2024

@Yakiv-Huryk
This design change is expected to impact our ZMQ performance. Could you please evaluate the ZMQ performance before and after the change?

@Yakiv-Huryk
Copy link
Contributor

@Yakiv-Huryk This design change is expected to impact our ZMQ performance. Could you please evaluate the ZMQ performance before and after the change?

I did the following test:

Sending 10k DASH_PREFIX_TAG_TABLE entries (each having 100 IPs).
The DASH_PREFIX_TAG_TABLE has the fastest processing on the DPU (the entries are just saved into the memory). This way we make sure that the test is impacted as little as possible by the end consumer of the config (DPU swss)
I've used the py_gnmicli.py from https://github.com/lguohan/gnxi as a client.

I measured the following:

  • the time it takes to send a config (to see if there is any back pressure from gnmi server)
  • the time it takes between the first and the last entry to be processed on the DPU
  • the cpu/dram usage for the gnmi process (telemetry.go) and zmq-proxy (dashoffloadmanager)

Test without proxy

  • GNMI client config apply time: ~16sec (+/-1sec)
  • DPU difference between first and last entry processed: ~16sec (+/-1sec)

Telemetry.go CPU/RAM:
telemetry-no-proxy2

Test with proxy

  • GNMI client config apply time: ~16sec (+/-1sec) (!)
  • DPU difference between first and last entry processed: ~16sec (+/-1sec) (!)

Telemetry.go CPU/RAM:
telemetry-with-proxy2

DashOffloadManager(zmq-proxy) CPU/RAM:
proxy2

The bottom line is that there is no measurable difference (at least via this test methodology).
The main difference is that with proxy, there is another process (another stage in the gnmi_client->gnmi_server->zmq_proxy->DPU SWSS) which doesn't impact the overall bandwidth of the config we can process.
The proxy does however use some CPU, in the extreme case (during a burst of config received) it will use 100% of a single CPU core.

To explain the behavior, I want to emphasize that the config goes through the pipeline (gnmi_client->gnmi_server->zmq_proxy->DPU SWSS), and the bandwidth of the pipeline is limited by its slowest stage. The proxy is the fastest stage here since it doesn't process the data (simply sending the raw data into zmq socket).

If you have any testing ideas/scenarios you want me to do, please share.

@Yakiv-Huryk
Copy link
Contributor

@ganglyu can you please review the performance test results in the above comment?

@kperumalbfn
Copy link
Contributor Author

@Yakiv-Huryk could you update all the Sonic swss PRs in this HLD description.

@ganglyu ganglyu requested a review from r12f November 19, 2024 23:32
@ganglyu
Copy link
Contributor

ganglyu commented Nov 19, 2024

@r12f would you please review this PR?


# Definitions/Abbrevations

| | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing table headers


## 2.1 Dash Offload Manager
The new orchagent application DashOffloadManager will be responsible for DASH offloading logic. It will collect all the needed information for offloading and perform all the relevant configurations.
To get the DASH configuration that should be offloaded, the DashOffloadManager will act as a transparent ZMQ proxy between the GNMI server and the DPU swss, forwarding all the configuration and intercepting the tables that should be offloaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having the same concern here.... This seem to be complicated to implement, because of a few things:

  1. The DASH offload manager or the NPU side DASH orchs need to provide the exact same feedback loop as the swss. If anything changed in swss, they need to be changed as well, which can be easily missed and causing problem.
  2. In order to support independent DPU upgrades, each DPU will needs to have its own DASH offload manager and all the DASH orchs in the NPU side.
  3. Dependency and object handling can also be a problem. This solution is trying to provide a generic way to handle all the DASH object offloading in the future, but I feel it doesn't really do the job. Explicit PA validation rule might be the simpliest case, where the only thing that we needs to do is to redirect the rules into the NPU side. However, other DASH objects can have dependencies, e.g., Implicit PA validation rules are coming from VNET + CA-PA mappings. In this case, we cannot simply redirect the rules into the NPU side, but have to copy it, because they are also used in the outbound pipeline.

Overall, I feel Gang is correct. The other way that Gang proposed there is actually much more cleaner and maintainable. All we need is just a if case in the swss, and every other things can be reused, such as feedback loops.


## 2.1 Dash Offload Manager
The new orchagent application DashOffloadManager will be responsible for DASH offloading logic. It will collect all the needed information for offloading and perform all the relevant configurations.
To get the DASH configuration that should be offloaded, the DashOffloadManager will act as a transparent ZMQ proxy between the GNMI server and the DPU swss, forwarding all the configuration and intercepting the tables that should be offloaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganglyu , please let us know your thought.


<img src="images/DashOffloadManagerWithConsumer.svg">

The Dash Offload Manager is disabled by default and only enabled for specific platforms that require its functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any logic to update the gnmi server to point to the proxy or the dpu?


<img src="images/DashOffloadManager.svg">

Once the offload is required, the DashOffloadManager will start designated orch (e.g. PAValidationOffloadOrch) that will subscribe to the configuration and do the offload logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the data being stored in NPU side? is it in NPU database or DPU database? If in NPU database, the schema will need to be changed, but I think it is not covered in the spec.

For each PA validation processed, the PaValidationOffloadOrch creates the following entry:

```
DASH_PA_VALIDATION_TABLE:{{vni}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean the NPU side PAValidationOffloadOrch will write into the DPU database?

Please refer to https://github.com/sonic-net/SONiC/pull/1759 for more details regarding GNMI feedback requirements and behavior.

## 2.3 DPU Shut/Restart
When DPU goes down/restarts, the ACL configuration should be cleaned. It's done by the Dash Offload Manager which listens to the ChassisStateDB DPU_STATE Table. When it detects that the DPU is down (dpu_control_plane_state is down), the PaValidationOffloadOrch is deinitialized, leading to ACL configuration cleanup and ZMQ proxy subscription removal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might also need to cover the DPU upgrade case, because when chassis db is updated, this orch might not be running at all.

@ganglyu to confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, we need to upgrade this orch when we upgrade the DPU.

"MATCHES": [
"TUNNEL_VNI",
"SRC_IP",
"SRC_IPV6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be better to add the destination IP in here as well. The reason is because SmartSwitch lives in T1 and can receive other traffic in the same VNET, which is not sent to the DPU, but to other VMs. Adding the SmartSwitch data plane VIP as destination IP will be safer and more future-proof.

@r12f
Copy link
Contributor

r12f commented Dec 3, 2024

and.... this design looks already implemented....

To get the DASH configuration that should be offloaded, the DashOffloadManager will act as a transparent ZMQ proxy between the GNMI server and the DPU swss, forwarding all the configuration and intercepting the tables that should be offloaded.
To simplify the management of the DPU offload and achieve optimal performance, every DPU is handled by a separate instance of a ZMQ Proxy (pair of ZMQ Server and Client)

<img src="images/DashOffloadManager.svg">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is after the gnmi splitter, will be better to make it more clear how it works with the independent dpu upgrade changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants