-
Notifications
You must be signed in to change notification settings - Fork 5
SWBus (Switch Bus)
SWBus is a high-performance and scalable message channel for SONiC internal services. It is designed to provide an easy-to-use interface for the intra/inter-switch communication for the internal services.
At its core, it provides a mesh network between all switches, and help with the message routing and forwarding between the switches.
SWBus is built on top of the gRPC framework and doesn't have any limitation on the message serialization for its payload.
The SWBus is composed of the following components:
-
SWitch Bus Core (
swbus-core
): This is the core implementation of the swbus, which provides the protocol, connection management for the mesh network and message routing, but does not provide the actual message handling as well as any knowledge of whatever network we are trying to create. -
SWitch Bus Daemon (
swbusd
): This is the services that runs on each switch. It runs SWitch Bus Core inside, provides the gRPC interfaces, parses the network configuration files and call SWitch Bus Core to create the network. -
SWitch Bus API (
swbus-api
): This is the API for any services that want to use SWitch Bus to communicate with other services. It is essentially a wrapper around the gRPC interfaces provided by SWitch Bus Daemon. It also provides a message filter layer for:- Local message routing, so that only the messages sends to other services will be forwarded to the SWitch Bus Daemon.
- Incoming message dispatching, so it can forward the messages to the correct message handler.
The 3 components are designed to be used together as graph shows below:
graph TB
subgraph NodeA
subgraph ServiceA
service-logic-a0("Service Logic")
service-logic-a1("Service Logic")
swbus-api-a("Switch Bus API (swbus-api)")
service-logic-a0 <--> swbus-api-a
service-logic-a1 <--> swbus-api-a
end
subgraph ServiceB
service-logic-b("Service Logic")
swbus-api-b("Switch Bus API (swbus-api)")
service-logic-b <--> swbus-api-b
end
subgraph swbusd-a["Switch Bus Daemon (swbusd)"]
grpc-server-a("gRPC Server")
swbus-core-a
swbus-api-a <--> grpc-server-a
swbus-api-b <--> grpc-server-a
grpc-server-a <--> swbus-core-a
end
end
subgraph NodeB
subgraph ServiceC
service-logic-c0("Service Logic")
service-logic-c1("Service Logic")
swbus-api-c("Switch Bus API (swbus-api)")
service-logic-c0 <--> swbus-api-c
service-logic-c1 <--> swbus-api-c
end
subgraph swbusd-c["Switch Bus Daemon (swbusd)"]
grpc-server-c("gRPC Server")
swbus-core-c
swbus-api-c <--> grpc-server-c
grpc-server-c <--> swbus-core-c
end
end
swbus-core-a <--> swbus-core-c
In the network of SWBus, all service logic (not service) are defined as the endpoints of the network, i.e. entities that send and receive messages. Each endpoint has its own unique address called Service Path in the system, which has the structure as below:
Node Locator | Service Locator | Resource Locator | ||||
---|---|---|---|---|---|---|
Region ID | Cluster ID | Node Id | Service Type | Service Id | Resource Type | Resource Id |
region-a | switch-cluster-a | 10.0.0.1-dpu0 | hamgrd | 0 | hascope | eni-0a1b2c3d4e5f6 |
The service path can be represented as a string: region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
.
We can also add dedicated message filters in swbus-api
in order to implement the endpoints that is only available in the service itself, such as redis db. This simplifies and unifies the communication to other existing services using swss-common redis communication channel.
Region ID | Cluster ID | Node Id | Service Type | Service Id | Resource Type | Resource Id |
---|---|---|---|---|---|---|
region-b | switch-cluster-b | 10.0.0.2-dpu1 | redis | APPL.SOME_TABLE | data | some_key:some_subkey |
The service path can be represented as this string: region-b/switch-cluster-b/10.0.0.2-dpu1/redis/0/APPL.SOME_TABLE/some_key:some_subkey
.
In the network request, the service path is defined in the protobuf message as below:
message ServicePath {
// Server location
string region_id = 10;
string cluster_id = 20;
string node_id = 30;
// Service info
string service_type = 110;
string service_id = 120;
// Resource info
string resource_type = 210;
string resource_id = 220;
}
Once the network is setup, it will form a mesh network between all the switches. Each switch will maintain a list of routes that points to each other, so that the message can be routed to the correct destination.
As an exmaple, the network topology is shown as below.
All messages in SWBus are unicast messages and routed based on the service path.
When a message is routed in the similar way as longest prefix match, it will:
- In
swbus-api
:- Use the full service path to find the exact match. If there is a match, route the message there.
- If not, try again with only the service location to find the match.
- If not, forward to the SWBus Daemon to find the match.
- In
swbusd
:- Use the service location to find the match.
- If not, try again with only region id, cluster id, node id, to find the match.
- If not, try again with only region id and cluster id to find the match.
- If not, try again with only region id to find the match.
- If still not, return NO_ROUTE error.
With this architecture, let's walk through how a message gets send and handled.
Let's say we have 2 services in the network - HAMgrD on DPU0 and DPU1, and they would like to communicate with each other for the same ENI resource - 0a1b2c3d4e5f6.
- Sender:
region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
- Receiver:
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
The life of a packet is as follows:
- HAMgrD in DPU0 sends a message to the receiver with:
- Destination =
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
- Source =
region-a/switch-cluster-a/10.0.0.1-dpu0/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
- Destination =
- In
swbus-api
:- First, use the full service path to find the next hop:
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
- Since not found, use the service location to find the next hop:
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
- Still not found, forward to the SWBus Daemon to find the next hop.
- First, use the full service path to find the next hop:
- In
swbusd
:- Use the service location to find the next hop:
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
- Since not found, use the node id to find the next hop:
region-a/switch-cluster-a/10.0.0.2-dpu1
- This time, the next hop will be found as an gRPC connection to
swbusd
runs on 10.0.0.2 for DPU1, so we will forward the message there.
- Use the service location to find the next hop:
- On 10.0.0.2, in
swbusd
:- Use the service location to find the next hop:
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0
- We will find the next hop as the hamgrd service on DPU1, so we will forward the message there.
- Use the service location to find the next hop:
- On 10.0.0.2, in HAMgrD
swbus-api
:- In the message filter layer, use the full service path to find the next hop:
region-a/switch-cluster-a/10.0.0.2-dpu1/hamgrd/0/hascope/eni-0a1b2c3d4e5f6
- We will find the message handler for ENI resource 0a1b2c3d4e5f6, so we will forward the message there.
- In the message filter layer, use the full service path to find the next hop:
- Upon receiving the message, the receiver can optionally respond a ACK message back to the sender after delivery, which goes through similar routing process.
The core of the network management is the connection store and route table inside the swbus-core
.
Whenever a new connection is estabilished, we will add this connection to the connection store:
- Each connection will be a bi-directional gRPC stream.
- A swbus connection will be created for each connection, whilch maintains:
- The connection metadata
- A worker for reading and writing messages
- A proxy factory so that anyone can create a proxy to send messages to the connection
With this, we will update the route table so we can route message to this connection.
The route table and message relay functionality is implemented by the SwbusMultiplexer
in the swbus-core
.
The route table is essentially a hash map with a string (service path) as the key and a next hop object as value. The next hop object will contain:
- The connection proxy that can relay the message to the connection.
- The hop count that will be used for route update, so we only have the shortest path in the route table.
classDiagram
class ServiceHost
class SwbusConnStore
class SwbusConn
class SwbusConnInfo
class SwbusConnWorker
class SwbusConnProxy
class SwbusMultiplexer
class SwbusNextHop
ServiceHost "1" *-- "1" SwbusConnStore
ServiceHost "1" *-- "1" SwbusMultiplexer
SwbusConnStore "1" *-- "0..n" SwbusConn
SwbusConn "1" *-- "1" SwbusConnWorker
SwbusConn --> SwbusConnInfo
SwbusMultiplexer "1" *-- "0..n" SwbusNextHop
SwbusNextHop --> SwbusConnInfo
SwbusNextHop "1" *-- "1" SwbusConnProxy
SwbusConnProxy .. SwbusConnWorker : Queue message to worker\nvia mpsc channel
SwbusConnWorker --> SwbusMultiplexer : Forward message to Mux\nfor message forwarding
The route table will be updated in serveral situations:
- When a new connection gets established, the initial route entries will be added. No matter if the connection is created locally or remotely, the hop count will be considered as 1.
- Route registration query and response: Every 1s,
swbusd
will sends a route registration query to all peers, and all peers will respond with their route table in full.
As we can see in the service path and message routing defined above, in swbusd
, a route / service path can be at regional level, or down to service level, so not all routes needs to be sends to its peer.
To control how wide a route will be broadcast, we first defined a type for each connection, that describes the purpose or origin of the connection and determines what kind of routes it will receive:
- Client: This connection is coming from CLI implementation.
- Node: This connection is estabilished for intra-node communication for all services on that node.
- Cluster: This connection is estabilished for cross-node communication within the same cluster.
- Region: This connection is estabilished for cross-cluster communication within the same region.
- Global: This connection is estabilished for cross-region communication.
When a route is added in swbusd
, e.g., new peer is connected or new routes received from other peers, the route will be broadcasted to its peer based on the route scope and connection type as below:
Peer Connection Type --- Route Scope |
Client | Node | Cluster | Region | Global |
---|---|---|---|---|---|
Client | No | No | No | No | No |
Node | No | No | No | No | No |
Cluster | No | No | Yes | No | No |
Region | No | No | Yes | Yes | No |
Global | No | No | Yes | Yes | Yes |
When swbusd
is launched, it will read the network configuration, such as routes needs to be announced, and all peers it needs to connect.
Each peer will contain the following key information:
- IP endpoint of the peer
swbusd
gRPC server - Service Path of the peer, with the region id, cluster id and node id. No service locator or resource location will be provided.
The configuration looks as below:
# Routes that will be advertised to all peers
routes:
- key: region-a/cluster-a/10.0.0.1-dpu0
scope: cluster
peers:
- id: region-a/cluster-a/10.0.0.2-dpu0
endpoint: 10.0.0.2:8000
type: cluster
- id: region-a/cluster-a/10.0.0.3-dpu0
endpoint: 10.0.0.3:8000
type: cluster
...
In our case, since all nodes are in the same cluster, and all initial routes are cluster scope, the network will soon converge into the state below:
Once the endpoints are connected and initial routes are setup, all services will start to connect to swbusd
and announce the service locators.
Since the network is structured, we don't need to announce the service locator to any peer, hence all service locator can be added using Node
scope. The end results is shown below:
After the service is up, service will start to load its managed resources. Each resources can also be represented as a service path, but it only needs to exist in the swbus-api of the service itself. Hence, the resource locators are added as message filters in the swbus-api.
The end result will be the same as the example shown in the "Network topology" section.
Whenever a connection is broken, it will be detected by the SwbusConnWorker
, due to the gRPC stream being closed. Then, then worker will break out of the worker loop and unregister itself from the Multiplexer, which getting the route table updated accordingly.
TODO: Aggresive retry and backoff mechanism for swbusd
initialted connections.
To debug message routing issues, SWBus supports 2 type of messages: Ping
and TraceRoute
, which act similarly as the ICMP ping and trace route and frequently used in debugging regular network issues.
Both messages are handled in the same way - whenever Multiplexer or swbus-api
receives the message, it will check these infra messages and handle them accordingly.
- A
Ping
message contains the same header as regular messages, which contains the source and destination service path as well as the TTL. - When a
Ping
message is received by the endpoint (swbus-api
), it will respond with an ACK message, which serves as thePong
. - If the ttl of the
Ping
message reaches 0 in the multiplexer, it will respond with aSWBUS_ERROR_CODE_UNREACHABLE
error. - If not route is found, it will respond with a
SWBUS_ERROR_CODE_NO_ROUTE
error. - Otherwise, Multiplexer will forward the message to the next hop just like the regular message.
- A
TraceRouteRequest
message contains the same header as regular messages, which contains the source and destination service path as well as the TTL. - If the ttl of the
TraceRouteRequest
message reaches 0 in the multiplexer, it will respond with aSWBUS_ERROR_CODE_UNREACHABLE
error. - If not route is found, it will respond with a
SWBUS_ERROR_CODE_NO_ROUTE
error. - Otherwise, Multiplexer will first respond a
TraceRouteResponse
message with the same trace id, then forward the message to the next hop just like the regular message. - When a
TraceRouteRequest
message is received by the endpoint (swbus-api
), it will respond with an ACK message, which serves as the complete signal of the trace route.