Skip to content

SWBus (Switch Bus)

Riff edited this page Oct 31, 2024 · 2 revisions

Overview

SWBus is a high-performance and scalable message channel for SONiC internal services. It is designed to provide an easy-to-use interface for the intra/inter-switch communication for the internal services.

At its core, it provides a mesh network between all switches, and help with the message routing and forwarding between the switches.

SWBus is built on top of the gRPC framework and doesn't have any limitation on the message serialization for its payload.

Top level concepts

The SWBus is composed of the following components:

  • SWitch Bus Core (swbus-core): This is the core implementation of the swbus, which provides the protocol, connection management for the mesh network and message routing, but does not provide the actual message handling as well as any knowledge of whatever network we are trying to create.
  • SWitch Bus Daemon (swbusd): This is the services that runs on each switch. It runs SWitch Bus Core inside, provides the gRPC interfaces, parses the network configuration files and call SWitch Bus Core to create the network.
  • SWitch Bus API (swbus-api): This is the API for any services that want to use SWitch Bus to communicate with other services. It is essentially a wrapper around the gRPC interfaces provided by SWitch Bus Daemon. It also provides a message filter layer for:
    • Local message routing, so that only the messages sends to other services will be forwarded to the SWitch Bus Daemon.
    • Incoming message dispatching, so it can forward the messages to the correct message handler.

The 3 components are designed to be used together as graph shows below:

graph TB
subgraph NodeA
    subgraph ServiceA
        service-logic-a0("Service Logic")
        service-logic-a1("Service Logic")
        swbus-api-a("Switch Bus API (swbus-api)")

        service-logic-a0 <--> swbus-api-a
        service-logic-a1 <--> swbus-api-a
    end

    subgraph ServiceB
        service-logic-b("Service Logic")
        swbus-api-b("Switch Bus API (swbus-api)")

        service-logic-b <--> swbus-api-b
    end

    subgraph swbusd-a["Switch Bus Daemon (swbusd)"]
        grpc-server-a("gRPC Server")
        swbus-core-a

        swbus-api-a <--> grpc-server-a
        swbus-api-b <--> grpc-server-a
        grpc-server-a <--> swbus-core-a
    end
end

subgraph NodeB
    subgraph ServiceC
        service-logic-c0("Service Logic")
        service-logic-c1("Service Logic")
        swbus-api-c("Switch Bus API (swbus-api)")

        service-logic-c0 <--> swbus-api-c
        service-logic-c1 <--> swbus-api-c
    end

    subgraph swbusd-c["Switch Bus Daemon (swbusd)"]
        grpc-server-c("gRPC Server")
        swbus-core-c

        swbus-api-c <--> grpc-server-c
        grpc-server-c <--> swbus-core-c
    end
end

swbus-core-a <--> swbus-core-c
Loading

Service Logic (Resource) as Endpoint

In the network of SWBus, all service logic (not service) are defined as the endpoints of the network, i.e. entities that send and receive messages. Each endpoint has its own unique address called Service Path in the system, which has the structure as below:

Node Locator Service Locator Resource Locator
Region ID Cluster ID Node Id Service Type Service Id Resource Type Resource Id
region-a switch-cluster-a 10.0.0.1-dpu0 hamgrd 0 hascope eni-0a1b2c3d4e5f6

The service path can be represented as a string: region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6.

We can also add dedicated message filters in swbus-api in order to implement the endpoints that is only available in the service itself, such as redis db. This simplifies and unifies the communication to other existing services using swss-common redis communication channel.

Region ID Cluster ID Node Id Service Type Service Id Resource Type Resource Id
region-b switch-cluster-b 10.0.0.2-dpu1 redis APPL.SOME_TABLE data some_key:some_subkey

The service path can be represented as this string: region-b/switch-cluster-b/10.0.0.2-dpu1/redis/0/APPL.SOME_TABLE/some_key:some_subkey.

In the network request, the service path is defined in the protobuf message as below:

message ServicePath {
  // Server location
  string region_id = 10;
  string cluster_id = 20;
  string node_id = 30;

  // Service info
  string service_type = 110;
  string service_id = 120;

  // Resource info
  string resource_type = 210;
  string resource_id = 220;
}

Network topology

Once the network is setup, it will form a mesh network between all the switches. Each switch will maintain a list of routes that points to each other, so that the message can be routed to the correct destination.

As an exmaple, the network topology is shown as below.

swbus-topo-full

Message routing

All messages in SWBus are unicast messages and routed based on the service path.

When a message is routed in the similar way as longest prefix match, it will:

  • In swbus-api:
    • Use the full service path to find the exact match. If there is a match, route the message there.
    • If not, try again with only the service location to find the match.
    • If not, forward to the SWBus Daemon to find the match.
  • In swbusd:
    • Use the service location to find the match.
    • If not, try again with only region id, cluster id, node id, to find the match.
    • If not, try again with only region id and cluster id to find the match.
    • If not, try again with only region id to find the match.
    • If still not, return NO_ROUTE error.

Life of a packet

With this architecture, let's walk through how a message gets send and handled.

Let's say we have 2 services in the network - HAMgrD on DPU0 and DPU1, and they would like to communicate with each other for the same ENI resource - 0a1b2c3d4e5f6.

  • Sender: region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
  • Receiver: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6

The life of a packet is as follows:

  • HAMgrD in DPU0 sends a message to the receiver with:
    • Destination = region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
    • Source = region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
  • In swbus-api:
    • First, use the full service path to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
    • Since not found, use the service location to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
    • Still not found, forward to the SWBus Daemon to find the next hop.
  • In swbusd:
    • Use the service location to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
    • Since not found, use the node id to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1
    • This time, the next hop will be found as an gRPC connection to swbusd runs on 10.0.0.2 for DPU1, so we will forward the message there.
  • On 10.0.0.2, in swbusd:
    • Use the service location to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
    • We will find the next hop as the hamgrd service on DPU1, so we will forward the message there.
  • On 10.0.0.2, in HAMgrD swbus-api:
    • In the message filter layer, use the full service path to find the next hop: region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
    • We will find the message handler for ENI resource 0a1b2c3d4e5f6, so we will forward the message there.
  • Upon receiving the message, the receiver can optionally respond a ACK message back to the sender after delivery, which goes through similar routing process.

Network management

Connection store

The core of the network management is the connection store and route table inside the swbus-core.

Whenever a new connection is estabilished, we will add this connection to the connection store:

  • Each connection will be a bi-directional gRPC stream.
  • A swbus connection will be created for each connection, whilch maintains:
    • The connection metadata
    • A worker for reading and writing messages
    • A proxy factory so that anyone can create a proxy to send messages to the connection

With this, we will update the route table so we can route message to this connection.

Multiplexer and route table

The route table and message relay functionality is implemented by the SwbusMultiplexer in the swbus-core.

The route table is essentially a hash map with a string (service path) as the key and a next hop object as value. The next hop object will contain:

  • The connection proxy that can relay the message to the connection.
  • The hop count that will be used for route update, so we only have the shortest path in the route table.
classDiagram
    class ServiceHost

    class SwbusConnStore
    class SwbusConn
    class SwbusConnInfo
    class SwbusConnWorker
    class SwbusConnProxy

    class SwbusMultiplexer
    class SwbusNextHop

    ServiceHost "1" *-- "1" SwbusConnStore
    ServiceHost "1" *-- "1" SwbusMultiplexer

    SwbusConnStore "1" *-- "0..n" SwbusConn
    SwbusConn "1" *-- "1" SwbusConnWorker
    SwbusConn --> SwbusConnInfo

    SwbusMultiplexer "1" *-- "0..n" SwbusNextHop
    SwbusNextHop --> SwbusConnInfo
    SwbusNextHop "1" *-- "1" SwbusConnProxy
    SwbusConnProxy .. SwbusConnWorker : Queue message to worker\nvia mpsc channel
    SwbusConnWorker --> SwbusMultiplexer : Forward message to Mux\nfor message forwarding
Loading

The route table will be updated in serveral situations:

  • When a new connection gets established, the initial route entries will be added. No matter if the connection is created locally or remotely, the hop count will be considered as 1.
  • Route registration query and response: Every 1s, swbusd will sends a route registration query to all peers, and all peers will respond with their route table in full.

Route scope and connection type

As we can see in the service path and message routing defined above, in swbusd, a route / service path can be at regional level, or down to service level, so not all routes needs to be sends to its peer.

To control how wide a route will be broadcast, we first defined a type for each connection, that describes the purpose or origin of the connection and determines what kind of routes it will receive:

  • Client: This connection is coming from CLI implementation.
  • Node: This connection is estabilished for intra-node communication for all services on that node.
  • Cluster: This connection is estabilished for cross-node communication within the same cluster.
  • Region: This connection is estabilished for cross-cluster communication within the same region.
  • Global: This connection is estabilished for cross-region communication.

When a route is added in swbusd, e.g., new peer is connected or new routes received from other peers, the route will be broadcasted to its peer based on the route scope and connection type as below:

Peer Connection Type
---
Route Scope
Client Node Cluster Region Global
Client No No No No No
Node No No No No No
Cluster No No Yes No No
Region No No Yes Yes No
Global No No Yes Yes Yes

Network setup

Initial network setup

When swbusd is launched, it will read the network configuration, such as routes needs to be announced, and all peers it needs to connect.

Each peer will contain the following key information:

  • IP endpoint of the peer swbusd gRPC server
  • Service Path of the peer, with the region id, cluster id and node id. No service locator or resource location will be provided.

The configuration looks as below:

# Routes that will be advertised to all peers
routes:
- key: region-a/cluster-a/10.0.0.1-dpu0
  scope: cluster
peers:
- id: region-a/cluster-a/10.0.0.2-dpu0
  endpoint: 10.0.0.2:8000
  type: cluster
- id: region-a/cluster-a/10.0.0.3-dpu0
  endpoint: 10.0.0.3:8000
  type: cluster
...

In our case, since all nodes are in the same cluster, and all initial routes are cluster scope, the network will soon converge into the state below:

swbus-topo-route-only

Service locator announcement

Once the endpoints are connected and initial routes are setup, all services will start to connect to swbusd and announce the service locators.

Since the network is structured, we don't need to announce the service locator to any peer, hence all service locator can be added using Node scope. The end results is shown below:

swbus-topo-with-service

Resource location announcement

After the service is up, service will start to load its managed resources. Each resources can also be represented as a service path, but it only needs to exist in the swbus-api of the service itself. Hence, the resource locators are added as message filters in the swbus-api.

The end result will be the same as the example shown in the "Network topology" section.

Failure handling

Whenever a connection is broken, it will be detected by the SwbusConnWorker, due to the gRPC stream being closed. Then, then worker will break out of the worker loop and unregister itself from the Multiplexer, which getting the route table updated accordingly.

TODO: Aggresive retry and backoff mechanism for swbusd initialted connections.

Debug infra

To debug message routing issues, SWBus supports 2 type of messages: Ping and TraceRoute, which act similarly as the ICMP ping and trace route and frequently used in debugging regular network issues.

Both messages are handled in the same way - whenever Multiplexer or swbus-api receives the message, it will check these infra messages and handle them accordingly.

Ping

  • A Ping message contains the same header as regular messages, which contains the source and destination service path as well as the TTL.
  • When a Ping message is received by the endpoint (swbus-api), it will respond with an ACK message, which serves as the Pong.
  • If the ttl of the Ping message reaches 0 in the multiplexer, it will respond with a SWBUS_ERROR_CODE_UNREACHABLE error.
  • If not route is found, it will respond with a SWBUS_ERROR_CODE_NO_ROUTE error.
  • Otherwise, Multiplexer will forward the message to the next hop just like the regular message.

Trace Route

  • A TraceRouteRequest message contains the same header as regular messages, which contains the source and destination service path as well as the TTL.
  • If the ttl of the TraceRouteRequest message reaches 0 in the multiplexer, it will respond with a SWBUS_ERROR_CODE_UNREACHABLE error.
  • If not route is found, it will respond with a SWBUS_ERROR_CODE_NO_ROUTE error.
  • Otherwise, Multiplexer will first respond a TraceRouteResponse message with the same trace id, then forward the message to the next hop just like the regular message.
  • When a TraceRouteRequest message is received by the endpoint (swbus-api), it will respond with an ACK message, which serves as the complete signal of the trace route.