Active-Active Dual ToR
Active-active dual ToR link manager is an evolution of active-standby dual ToR link manager. Both ToRs are expected to handle traffic in normal scenarios. For consistency, we will keep using the term "standby" to refer inactive links or ToRs.
Revision
Rev | Date | Author | Change Description |
---|---|---|---|
0.1 | 05/23/22 | Jing Zhang | Initial version |
0.2 | 12/02/22 | Longxiang Lyu | Add Traffic Forwarding section |
0.3 | 12/08/22 | Longxiang Lyu | Add BGP update delay section |
0.4 | 12/13/22 | Longxiang Lyu | Add skip ACL section |
0.5 | 04/10/23 | Longxiang Lyu | Add command line section |
Scope
This document provides the high level design of SONiC dual toR solution, supporting active-active setup.
Content
3 SONiC ToR Controlled Solution
1 Cluster Topology
There are a certain number of racks in a row, each rack will have 2 ToRs, and each row will have 8 Tier One (T1s) network devices. Each server will have a NIC connected to 2 ToRs with 100 Gbps DAC cables.
In this design:
-
Both upper ToR (labeled as UT0) and lower ToR (labeled as LT0) will advertise same IP to upstream T1s, each T1 will see 2 available next hops for the VLAN.
-
Both UT0 and LT0 are expected to carry traffic in normal scenarios.
-
The software stack on server host will see 200 Gbps NIC.
2 Requrement Overview
2.1 Server Requirements
In our cluster setup, as smart y-cable is replaced, some complexity shall be transferred to server NIC.
Note that, this complexity can be handled by active-active smart cables, or any other deployments, as long as long it meets the requirements below.
-
Server NIC is responsible to deliver southbound (tier 0 device to server) traffic from either uplinks to applications running on server host.
- ToRs are presenting same IP, same MAC to server on both links.
-
Server NIC is responsible to dispense northbound (server to tier 0) traffic between two active links: at IO stream (5 tuples) level. Each stream will be dispatched to one of the 2 uplinks until link state changes.
-
Server should provide support for ToR to control traffic forwarding, and follow this control when dispensing traffic.
-
gRPC is introduced for this requirement.
-
Each ToR will have a well-known IP. Server NIC should dispatch gRPC replies towards these IPs to the corresponding uplinks.
-
-
Server NIC should avoid sending traffic through unhealthy links when detecting a link state down.
-
Server should replicate these northbound traffic to both ToRs:
-
Specified ICMP replies (for probing link health status)
-
ARP propagation
-
IPv6 router solicitation, neighbor solicitation and neighbor advertisements
Check pseudo code below for details of IO scheduling contract.
-
2.2 SONiC Requirements
-
Introduce active-active mode into MUX state machine.
-
Probe to determine if link is healthy or not.
-
Signal NIC if ToR is switching active or standby.
-
Rescue when peer ToR failure occures.
-
Unblock traffic when cable control channel is unreachable.
3 SONiC ToR Controlled Solution
3.1 IP Routing
3.1.1 Normal Scenario
Both T0s are up and functioning and both the server NIC connections are up and functioning.
-
Control PlaneUT0 and LT0 will advertise same VLAN (IPv4 and IPv6) to upstream T1s. Each T1 will see there are 2 available next hops for the VLAN. T1s advertise to T2 as normal.
-
Data Plane
-
Traffic to the server
-
Traffic from the server to outside the cluster
-
Traffic from the server to within the cluster
-
3.1.2 Server Uplink Issue
Both T0s are up and functioning and some servers NIC are only connected to 1 ToR (due to cable issue, or the cable is taken out for maintenance).
-
Control PlaneNo change from the normal case.
-
Data Plane
3.1.3 ToR Failure
Only 1 T0s is up and functioning.
-
Control PlaneOnly 1 T0 will advertise the VLAN (IPv4 and v6) to upstream T1s.
-
Data Plane
3.1.4 Comparison to Active-Standby
Highlight on the common and differences with Active-Standby:
Active- Standby | Active-Active | Implication | |
---|---|---|---|
Server uplink view | Single IP, single MAC | Single IP, single MAC | |
Standby side receive traffic | Forward it to active ToR through IPinIP tunnel via T1 | Forward it to active ToR through IPinIP tunnel via T1 | |
T0 to T1 control plane | Advertise same set of routes | Advertise same set of routes | |
T1 to T0 Traffic | ECMP | ECMP | |
Southbound traffic | From either side | From either side | |
Northbound traffic | All is duplicated to both ToRs | NiC determines which side to forward the traffic | Orchagent doesn’t need to drop packets on standby side |
Bandwidth | Up to 1 link | Up to 2 links | T1 and above devices see more throughput from server |
Cable Control | I2C | gRPC over DAC cables | Control plane and data plane now share the same link |
3.2 DB Schema Changes
3.2.1 Config DB
- New field in
MUX_CABLE
table to determine cable type
MUX_CABLE|PORTNAME:
cable_type: active-standby|active-active
3.2.2 App DB
- New table to invoke transceiver daemon to query server side forwarding state
FORWARDING_STATE_COMMAND | PORTNAME:
command: probe | set_active_self | set_standby_self | set_standby_peer
FORWARDING_STATE_RESPONSE | PORTNAME:
response: active | standby | unknown | error
response_peer: active | standby | unknown | error
- New table for transceiver daemon to write peer link state to linkmgrd
PORT_TABLE_PEER|PORTNAME
oper_status: up|down
- New table to invoke transceiver daemon to set peer's server side forwarding state
HW_FORWARDING_STATE_PEER|PORTNAME
state: active|standby|unknown
3.2.3 State DB
- New table for transceiver daemon to write peer's server side forwarding state to linkmgrd
HW_MUX_CABLE_TABLE_PEER| PORTNAME
state: active |standby|unknown
3.3 Linkmgrd
Linkmgrd will provide the determination of a ToR / link's readiness for use.
3.3.1 Link Prober
Linkmgrd will keep the link prober design from active-standby mode for monitoring link health status. Link prober will send ICMP packets and listen to ICMP response packets. ICMP packets will contain payload information about the ToR. ICMP replies will be duplicated to both ToRs from the server, hence a ToR can monitor the health status of its peer ToR as well.
Link Prober will report 4 possible states:
-
LinkProberUnknown: Serves as initial states. This state is also reachable in the case of no ICMP reply is received.
-
LinkProberActive: It indicates that LinkMgr receives ICMP replies containing ID of the current ToR.
-
LinkProberPeerUnknown: It indicates that LinkMgr did not receive ICMP replies containing ID of the peer ToR. Hence, there is a chance that peer ToR’s link is currently down.
-
LinkProberPeerAcitve: It indicates that LinkMgr receives ICMP replies containing ID of the peer ToR, or in other words, peer ToR’s links appear to be active.
By default, the heartbeat probing interval is 100 ms. It takes 3 lost of link prober packets, to determine link is unhealthy. Server issue can also cause link prober packet loss, but ToR won't distinguish it from link issue.
ICMP Probing Format
The source MAC will be ToR's SVI mac address. Ethernet destination will be the well-known MAC address. Source IP will be ToR's Loopback IP, destination IP will be SoC's IP address, which will be introduced as a field in minigraph.
Linkmgrd also adapt TLV (Type-Length-Value) as the encoding schema in payload for additional information elements, including cookie, version, ToR GUID etc.
3.3.2 Link State
When link is down, linkmgrd will receive notification from SWSS based on kernel message from netlink. This notification will be used to determine if ToR is healthy.
3.3.3 Forwarding State
Admin Forwarding StateToRs will signal NIC if the link is active / standby, we will call this active / standby state as admin forwarding state. It's up to NIC to determine which link to use if both are active, but it should never choose to use a standby link. This logic provides ToR more control over traffic forwarding.
Operational Forwarding StateServer side should maintain an operational forwarding state as well. When link is down, eventually admin forwarding state will be updated to standby. But before that, if server side detects link down, it should stop sending traffic through this link even the admin state is active. In this way, we ensure the ToRs have control over traffic forwarding, and also guarantee immediate reaction when link state is down.
3.3.4 Acitve-Active State Machine
Active-acitve state transition logics are simplified compared to active-standby. In active-standby, linkmgrd makes mux toggle decisions based on y-cable direction, while for active-active, two links are more independent. Linkmgrd will only make state transition decisions based on healthy indicators.
To be more specific, if link prober indicates active AND link state appears to be up, linkmgrd should determine link's forwarding state as active, otherwise, it should be standby.
Linkmgrd also provides rescue mechanism when peer can't switch to standby for some reason, i.e. link failures. If link prober doesn't receive peer's heartbeat response AND self ToR is in healthy active state, linkmgrd should determine peer link to be standby.
When control channel is unreachable, ToR won't block the traffic forwarding, but it will periodically check gRPC server's healthiness. It will make sure server side's admin forwarding state aligns with linkmgrd's decision.
3.3.5 Default route to T1
If default route to T1 is missing, dual ToR system can suffer from northbound packet loss, hence linkmgrd also monitors defaul route state. If default route is missing, linkmgrd will stop sending ICMP probing request and fake an unhealthy status. This functionality can be disabled as well, the details is included in default_route.
To summarize the state transition decision we talk about, and the corresponding gRPC action to take, we have this decision table below:
3.3.6 Incremental Featrues
-
Link Prober Packet Loss StaticsLink prober will by default send heartbeat packet every 100 ms, the packet loss statics can be a good measurement of system healthiness. An incremental feature is to collect the packet loss counts, start time and end time. The collected data is stored and updated in state db. User can check and reset through CLI.
-
Supoort for DetachmentUser can config linkmgrd to a certain mode, so it won't switch to active / standby based on health indicators. User can also config linkmgrd to a mode, so it won't modify peer's forwarding state. This support will be useful for maintenance, upgrade and testing scenarios.
3.4 Orchagent
3.4.1 IPinIP tunnel
Orchagent will create tunnel at initialization and add / remove routes to forward traffic to peer ToR via this tunnel when linkmgrd switchs state to standby / active.
Check below for an example of config DB entry and tunnel utilization when LT0's link is having issue.
3.4.2 Flow Diagram and Orch Components
Major components of Orchagent for this IPinIP tunnel are MuxCfgOrch, TunnelOrch, MuxOrch.
-
MuxCfgOrchMuxCfgOrch listens to config DB entries to populate the port to server IP mapping to MuxOrch.
-
TunnelOrchTunnelOrch will subscribe to
MUX_TUNNEL
table and create tunnel, tunnel termination, and decap entry. This tunnel object would be created when initializing. This tunnel object would be used as nexthop object by MuxOrch for programming route via SAI_NEXT_HOP_TYPE_TUNNEL_ENCAP. -
MuxOrchMuxOrch will listen to state changes from linkmgrd and does the following at a high-level:
-
Enable / disable neighbor entry.
-
Add / remove tunnel routes.
-
3.5 Transceiver Daemon
3.5.1 Cable Control through gRPC
In active-active design, we will use gRPC to do cable control and signal NIC if ToRs is up active. SoC will run a gRPC server. Linkmgrd will determine server side forwarding state based on link prober status and link state. Then linkmgrd can invoke transceiver daemon to update NIC wether ToRs are active or not through gRPC calls.
Current defined gRPC services between SoC and ToRs related with linkmgrd cable controlling:
-
DualToRActive
-
GracefulRestart
3.6 State Transition Flow
The following UML sequence illustrates the state transition when linkmgrd state moves to active. The flow will be similar for moving to standby.
3.7 Traffic Forwarding
The following shows the traffic forwarding behaviors:
-
both ToRs are active.
-
one ToR is active while the another ToR is standby.
3.7.1 Special Cases of Traffic Forwarding
3.7.1.1 gRPC Traffic to the NiC IP
There is a scenario that, if the upper ToR enters standby when its peer(the lower ToR) is already in standby state, all downstream I/O from ToR A will be forwarded through the tunnel to the peer ToR(the lower ToR), so does the control plane gRPC traffic from the transceiver daemon. As the lower ToR is in standby, those tunneled I/O will be blackholed, the NiC will never know that the upper ToR has entered standby in this case.
To solve this issue, we want the control plane gRPC traffic from the transceiver daemon to be forwarded directly via the local devices. This is to differentiate the control plane traffic to the NiC IPs from dataplane traffic that its forwarding behavior honors the mux state and be forwarded to the peer active ToR via the tunnel when the port comes to standby.
The following shows the traffic forwarding behavior when the lower ToR is active while the upper ToR is standby. Now, gRPC traffic from the standby ToR(Upper ToR) is forwarded to the NiC directly. The downstream dataplane traffic to the Upper ToR are directed to the tunnel to the active Lower ToR.
When orchagent is notified to change to standby, it will re-program both the ASIC and the kernel to let both control plane and data plane traffic be forwarded via the tunnel. To achieve the design proposed above, MuxOrch now will be changed to skip notifying the Tunnelmgrd if the neighbor address is the NiC IP address, so Tunnelmgrd will not re-program the kernel route in this case and the gRPC traffic to the NiC IP address from the transceiver daemon will be forwarded directly.
The following UML diagram shows this change when Linkmgrd state moves to standby:
3.8 Enhancements
3.8.1 Advertise updated routes to T1
Current failover strategy can smoothly handle the link failure cases, but if one of the ToRs crashes, and if T1 still sends traffic to the crashed ToR, we will see packet loss.
A further improvement in rescuing scenario, is when detecting peer's unhealthy status, local ToR advertises specific routes (i.e. longer prefix), so that traffic from T1 does't go to crashed ToR as all.
3.8.2 Server Servicing & ToR Upgrade
For server graceful restart, We already have gRPC service defined in 3.5.1. An indicator of ongoing server servicing should be defined based on that notification, so ToR can avoid upgrades in the meantime. Vice versa, we can also define gRPC APIs to notify server when ToR upgrade is ongoing.
3.8.3 BGP update delay
When the BGP neighbors are started on an active-active T0 switch, the T0 will try to establish BGP sessions with its connected T1 switches. After the BGP sessions' establishment, the T0 will exchange routes with those T1s. T1 switches usually have more routes than the T0 so T1 switches take more time to process out routes before sending updates. The consequence is that, after BGP sessions’ establishment, T1 switches could receive BGP updates from the T0 before the T0 receives any BGP updates from the T1s. There will be a period that those T1s have routes learnt from the T0 but the T0 has no routes learnt from the T1(T0 has no default routes). In this period, Those T1s could send downstream traffic to this T0, as stated in 3.3.5, the T0 is still in standby state, it will try to forward the traffic via the tunnel. As the T0 has no default route in this period, those traffic will be blackholed.
So for the active-active T0s, a BGP update delay of 10 seconds is introduced to the BGP configurations to postpone sending BGP update after BGP session establishment. In this case, the T0 could learn routes from the T1s before the T1s learn any routes from the T0. So when the T1 could send any downstream traffic to the T0, the T0 will have default routes ready.
3.8.4 Skip adding ingress drop ACL
Previously, at a high level, when the mux port comes to standby, the MuxOrch add ingress ACL to drop packets on the mux port. And when the mux port comes to active, the MuxOrch remove the ingress ACL. As described in [3.6], the MuxOrch is acted an intermediate agent between LinkMgrd and the transceiver daemon. Before the NiC receives gRPC request to toggle standby, the ingress drop ACL has already been programmed by MuxOrch. In this period, the server NiC still regard this ToR as active and could send upstream traffic to this ToR, but the upstream traffic will be dropped by the installed ingress drop ACL rule.
A change to skip the installation of ingress drop ACL rule when toggling standby is introduced to forward the upstream traffic with best effort. This is because that, though the mux port is already in standby state in this period, the removal of the ingress drop ACL could allow the upstream traffic to reach the ToR and to be possibly forwarded by the ToR.
3.9 Command Line
This part only covers the command lines and options for active-active dualtor.
3.9.1 Show mux status
show mux status
returns the mux status for mux ports:
-
PORT
: mux port name -
STATUS
: current mux status, could be eitheractive
orstandby
-
SERVER_STATUS
: the mux status read from mux server as the result of last toggle -
HEALTH
: mux port health -
HWSTATUS
: check if current mux status matches server status -
LAST_SWITCHOVER_TIME
: last switchover timestamp
$ show mux status
PORT STATUS SERVER_STATUS HEALTH HWSTATUS LAST_SWITCHOVER_TIME
---------- -------- --------------- -------- ---------- ---------------------------
Ethernet4 active active healthy consistent 2023-Mar-27 07:57:43.314674
Ethernet8 active active healthy consistent 2023-Mar-27 07:59:33.227819
3.9.2 Show mux config
show mux config
returns the mux configurations:
-
SWITCH_NAME
: peer switch hostname -
PEER_TOR
: peer switch loopback address -
PORT
: mux port name -
state
: mux mode configuration -
ipv4
: mux server ipv4 address -
ipv6
: mux server ipv6 address -
cable_type
: mux cable type,active-active
for active-active dualtor -
soc_ipv4
: soc ipv4 address
$ show mux config
SWITCH_NAME PEER_TOR
----------------- ----------
lab-switch-2 10.1.0.33
port state ipv4 ipv6 cable_type soc_ipv4
---------- ------- --------------- ----------------- ------------- ---------------
Ethernet4 auto 192.168.0.2/32 fc02:1000::2/128 active-active 192.168.0.3/32
Ethernet8 auto 192.168.0.4/32 fc02:1000::4/128 active-active 192.168.0.5/32
3.9.3 Show mux tunnel-route
show mux tunnel-route
returns tunnel routes that have been created for mux ports.
For each mux port, there can be 3 entries: server_ipv4
, server_ipv6
, soc_ipv4
. For each entry, if tunnel route is created in kernel
or asic
, you will see added
in command output, if not, you will see -
. If no tunnel route is created for any of the 3 entries, mux port won't show in the command output.
- Usage:
show mux tunnel-route [OPTIONS] <port_name>
show muxcable tunnel-route <port_name>
- Options:
--json display the output in json format
- Example
$ show mux tunnel-route Ethernet44
PORT DEST_TYPE DEST_ADDRESS kernel asic
---------- ----------- ----------------- -------- ------
Ethernet44 server_ipv4 192.168.0.22/32 added added
Ethernet44 server_ipv6 fc02:1000::16/128 added added
Ethernet44 soc_ipv4 192.168.0.23/32 - added
3.9.4 Config mux mode
config mux mode
configures the operational mux mode for specified port.
# config mux mode <operation_status> <port_name>
argument "<operation_status>" is choose from:
active,
auto,
manual,
standby,
detach.
4 Warm Reboot Support
TBD