Active-Active Dual ToR

Active-active dual ToR link manager is an evolution of active-standby dual ToR link manager. Both ToRs are expected to handle traffic in normal scenarios. For consistency, we will keep using the term "standby" to refer inactive links or ToRs.

Revision

Rev	Date	Author	Change Description
0.1	05/23/22	Jing Zhang	Initial version
0.2	12/02/22	Longxiang Lyu	Add Traffic Forwarding section
0.3	12/08/22	Longxiang Lyu	Add BGP update delay section
0.4	12/13/22	Longxiang Lyu	Add skip ACL section
0.5	04/10/23	Longxiang Lyu	Add command line section

Scope

This document provides the high level design of SONiC dual toR solution, supporting active-active setup.

Content

1 Cluster Topology

2 Requrement Overview

3 SONiC ToR Controlled Solution

4 Warm Reboot Support

1 Cluster Topology

There are a certain number of racks in a row, each rack will have 2 ToRs, and each row will have 8 Tier One (T1s) network devices. Each server will have a NIC connected to 2 ToRs with 100 Gbps DAC cables.

In this design:

Both upper ToR (labeled as UT0) and lower ToR (labeled as LT0) will advertise same IP to upstream T1s, each T1 will see 2 available next hops for the VLAN.
Both UT0 and LT0 are expected to carry traffic in normal scenarios.
The software stack on server host will see 200 Gbps NIC.

2 Requrement Overview

2.1 Server Requirements

In our cluster setup, as smart y-cable is replaced, some complexity shall be transferred to server NIC.

Note that, this complexity can be handled by active-active smart cables, or any other deployments, as long as long it meets the requirements below.

Server NIC is responsible to deliver southbound (tier 0 device to server) traffic from either uplinks to applications running on server host.
- ToRs are presenting same IP, same MAC to server on both links.
Server NIC is responsible to dispense northbound (server to tier 0) traffic between two active links: at IO stream (5 tuples) level. Each stream will be dispatched to one of the 2 uplinks until link state changes.
Server should provide support for ToR to control traffic forwarding, and follow this control when dispensing traffic.
- gRPC is introduced for this requirement.
- Each ToR will have a well-known IP. Server NIC should dispatch gRPC replies towards these IPs to the corresponding uplinks.
Server NIC should avoid sending traffic through unhealthy links when detecting a link state down.
Server should replicate these northbound traffic to both ToRs:
- Specified ICMP replies (for probing link health status)
- ARP propagation
- IPv6 router solicitation, neighbor solicitation and neighbor advertisements
Check pseudo code below for details of IO scheduling contract.

2.2 SONiC Requirements

Introduce active-active mode into MUX state machine.
Probe to determine if link is healthy or not.
Signal NIC if ToR is switching active or standby.
Rescue when peer ToR failure occures.
Unblock traffic when cable control channel is unreachable.

3 SONiC ToR Controlled Solution

3.1 IP Routing

3.1.1 Normal Scenario

Both T0s are up and functioning and both the server NIC connections are up and functioning.

Control PlaneUT0 and LT0 will advertise same VLAN (IPv4 and IPv6) to upstream T1s. Each T1 will see there are 2 available next hops for the VLAN. T1s advertise to T2 as normal.
Data Plane
- Traffic to the server
- Traffic from the server to outside the cluster
- Traffic from the server to within the cluster

3.1.2 Server Uplink Issue

Both T0s are up and functioning and some servers NIC are only connected to 1 ToR (due to cable issue, or the cable is taken out for maintenance).

Control PlaneNo change from the normal case.
Data Plane

3.1.3 ToR Failure

Only 1 T0s is up and functioning.

Control PlaneOnly 1 T0 will advertise the VLAN (IPv4 and v6) to upstream T1s.
Data Plane

3.1.4 Comparison to Active-Standby

Highlight on the common and differences with Active-Standby:

	Active- Standby	Active-Active	Implication
Server uplink view	Single IP, single MAC	Single IP, single MAC
Standby side receive traffic	Forward it to active ToR through IPinIP tunnel via T1	Forward it to active ToR through IPinIP tunnel via T1
T0 to T1 control plane	Advertise same set of routes	Advertise same set of routes
T1 to T0 Traffic	ECMP	ECMP
Southbound traffic	From either side	From either side
Northbound traffic	All is duplicated to both ToRs	NiC determines which side to forward the traffic	Orchagent doesn’t need to drop packets on standby side
Bandwidth	Up to 1 link	Up to 2 links	T1 and above devices see more throughput from server
Cable Control	I2C	gRPC over DAC cables	Control plane and data plane now share the same link

3.2 DB Schema Changes

3.2.1 Config DB

New field in MUX_CABLE table to determine cable type

MUX_CABLE|PORTNAME:
  cable_type: active-standby|active-active

3.2.2 App DB

New table to invoke transceiver daemon to query server side forwarding state

FORWARDING_STATE_COMMAND | PORTNAME:
  command: probe | set_active_self | set_standby_self | set_standby_peer 
FORWARDING_STATE_RESPONSE | PORTNAME:
  response: active | standby | unknown | error 
  response_peer: active | standby | unknown | error

New table for transceiver daemon to write peer link state to linkmgrd

PORT_TABLE_PEER|PORTNAME
  oper_status: up|down

New table to invoke transceiver daemon to set peer's server side forwarding state

HW_FORWARDING_STATE_PEER|PORTNAME
  state: active|standby|unknown

3.2.3 State DB

New table for transceiver daemon to write peer's server side forwarding state to linkmgrd

HW_MUX_CABLE_TABLE_PEER| PORTNAME
 state: active |standby|unknown

3.3 Linkmgrd

Linkmgrd will provide the determination of a ToR / link's readiness for use.

3.3.1 Link Prober

Linkmgrd will keep the link prober design from active-standby mode for monitoring link health status. Link prober will send ICMP packets and listen to ICMP response packets. ICMP packets will contain payload information about the ToR. ICMP replies will be duplicated to both ToRs from the server, hence a ToR can monitor the health status of its peer ToR as well.

Link Prober will report 4 possible states:

LinkProberUnknown: Serves as initial states. This state is also reachable in the case of no ICMP reply is received.
LinkProberActive: It indicates that LinkMgr receives ICMP replies containing ID of the current ToR.
LinkProberPeerUnknown: It indicates that LinkMgr did not receive ICMP replies containing ID of the peer ToR. Hence, there is a chance that peer ToR’s link is currently down.
LinkProberPeerAcitve: It indicates that LinkMgr receives ICMP replies containing ID of the peer ToR, or in other words, peer ToR’s links appear to be active.

By default, the heartbeat probing interval is 100 ms. It takes 3 lost of link prober packets, to determine link is unhealthy. Server issue can also cause link prober packet loss, but ToR won't distinguish it from link issue.

ICMP Probing Format

The source MAC will be ToR's SVI mac address. Ethernet destination will be the well-known MAC address. Source IP will be ToR's Loopback IP, destination IP will be SoC's IP address, which will be introduced as a field in minigraph.

Linkmgrd also adapt TLV (Type-Length-Value) as the encoding schema in payload for additional information elements, including cookie, version, ToR GUID etc.

3.3.2 Link State

When link is down, linkmgrd will receive notification from SWSS based on kernel message from netlink. This notification will be used to determine if ToR is healthy.

3.3.3 Forwarding State

Admin Forwarding StateToRs will signal NIC if the link is active / standby, we will call this active / standby state as admin forwarding state. It's up to NIC to determine which link to use if both are active, but it should never choose to use a standby link. This logic provides ToR more control over traffic forwarding.

Operational Forwarding StateServer side should maintain an operational forwarding state as well. When link is down, eventually admin forwarding state will be updated to standby. But before that, if server side detects link down, it should stop sending traffic through this link even the admin state is active. In this way, we ensure the ToRs have control over traffic forwarding, and also guarantee immediate reaction when link state is down.

3.3.4 Acitve-Active State Machine

Active-acitve state transition logics are simplified compared to active-standby. In active-standby, linkmgrd makes mux toggle decisions based on y-cable direction, while for active-active, two links are more independent. Linkmgrd will only make state transition decisions based on healthy indicators.

To be more specific, if link prober indicates active AND link state appears to be up, linkmgrd should determine link's forwarding state as active, otherwise, it should be standby.

Linkmgrd also provides rescue mechanism when peer can't switch to standby for some reason, i.e. link failures. If link prober doesn't receive peer's heartbeat response AND self ToR is in healthy active state, linkmgrd should determine peer link to be standby.

When control channel is unreachable, ToR won't block the traffic forwarding, but it will periodically check gRPC server's healthiness. It will make sure server side's admin forwarding state aligns with linkmgrd's decision.

3.3.5 Default route to T1

If default route to T1 is missing, dual ToR system can suffer from northbound packet loss, hence linkmgrd also monitors defaul route state. If default route is missing, linkmgrd will stop sending ICMP probing request and fake an unhealthy status. This functionality can be disabled as well, the details is included in default_route.

To summarize the state transition decision we talk about, and the corresponding gRPC action to take, we have this decision table below:

3.3.6 Incremental Featrues

Link Prober Packet Loss StaticsLink prober will by default send heartbeat packet every 100 ms, the packet loss statics can be a good measurement of system healthiness. An incremental feature is to collect the packet loss counts, start time and end time. The collected data is stored and updated in state db. User can check and reset through CLI.
Supoort for DetachmentUser can config linkmgrd to a certain mode, so it won't switch to active / standby based on health indicators. User can also config linkmgrd to a mode, so it won't modify peer's forwarding state. This support will be useful for maintenance, upgrade and testing scenarios.

3.4 Orchagent

3.4.1 IPinIP tunnel

Orchagent will create tunnel at initialization and add / remove routes to forward traffic to peer ToR via this tunnel when linkmgrd switchs state to standby / active.

Check below for an example of config DB entry and tunnel utilization when LT0's link is having issue.

3.4.2 Flow Diagram and Orch Components

Major components of Orchagent for this IPinIP tunnel are MuxCfgOrch, TunnelOrch, MuxOrch.

MuxCfgOrchMuxCfgOrch listens to config DB entries to populate the port to server IP mapping to MuxOrch.
TunnelOrchTunnelOrch will subscribe to MUX_TUNNEL table and create tunnel, tunnel termination, and decap entry. This tunnel object would be created when initializing. This tunnel object would be used as nexthop object by MuxOrch for programming route via SAI_NEXT_HOP_TYPE_TUNNEL_ENCAP.
MuxOrchMuxOrch will listen to state changes from linkmgrd and does the following at a high-level:
- Enable / disable neighbor entry.
- Add / remove tunnel routes.

3.5 Transceiver Daemon

3.5.1 Cable Control through gRPC

In active-active design, we will use gRPC to do cable control and signal NIC if ToRs is up active. SoC will run a gRPC server. Linkmgrd will determine server side forwarding state based on link prober status and link state. Then linkmgrd can invoke transceiver daemon to update NIC wether ToRs are active or not through gRPC calls.

Current defined gRPC services between SoC and ToRs related with linkmgrd cable controlling:

DualToRActive
GracefulRestart

3.6 State Transition Flow

The following UML sequence illustrates the state transition when linkmgrd state moves to active. The flow will be similar for moving to standby.

3.7 Traffic Forwarding

The following shows the traffic forwarding behaviors:

both ToRs are active.
one ToR is active while the another ToR is standby.

3.7.1 Special Cases of Traffic Forwarding

3.7.1.1 gRPC Traffic to the NiC IP

There is a scenario that, if the upper ToR enters standby when its peer(the lower ToR) is already in standby state, all downstream I/O from ToR A will be forwarded through the tunnel to the peer ToR(the lower ToR), so does the control plane gRPC traffic from the transceiver daemon. As the lower ToR is in standby, those tunneled I/O will be blackholed, the NiC will never know that the upper ToR has entered standby in this case.

To solve this issue, we want the control plane gRPC traffic from the transceiver daemon to be forwarded directly via the local devices. This is to differentiate the control plane traffic to the NiC IPs from dataplane traffic that its forwarding behavior honors the mux state and be forwarded to the peer active ToR via the tunnel when the port comes to standby.

The following shows the traffic forwarding behavior when the lower ToR is active while the upper ToR is standby. Now, gRPC traffic from the standby ToR(Upper ToR) is forwarded to the NiC directly. The downstream dataplane traffic to the Upper ToR are directed to the tunnel to the active Lower ToR.

When orchagent is notified to change to standby, it will re-program both the ASIC and the kernel to let both control plane and data plane traffic be forwarded via the tunnel. To achieve the design proposed above, MuxOrch now will be changed to skip notifying the Tunnelmgrd if the neighbor address is the NiC IP address, so Tunnelmgrd will not re-program the kernel route in this case and the gRPC traffic to the NiC IP address from the transceiver daemon will be forwarded directly.

The following UML diagram shows this change when Linkmgrd state moves to standby:

3.8 Enhancements

3.8.1 Advertise updated routes to T1

Current failover strategy can smoothly handle the link failure cases, but if one of the ToRs crashes, and if T1 still sends traffic to the crashed ToR, we will see packet loss.

A further improvement in rescuing scenario, is when detecting peer's unhealthy status, local ToR advertises specific routes (i.e. longer prefix), so that traffic from T1 does't go to crashed ToR as all.

3.8.2 Server Servicing & ToR Upgrade

For server graceful restart, We already have gRPC service defined in 3.5.1. An indicator of ongoing server servicing should be defined based on that notification, so ToR can avoid upgrades in the meantime. Vice versa, we can also define gRPC APIs to notify server when ToR upgrade is ongoing.

3.8.3 BGP update delay

When the BGP neighbors are started on an active-active T0 switch, the T0 will try to establish BGP sessions with its connected T1 switches. After the BGP sessions' establishment, the T0 will exchange routes with those T1s. T1 switches usually have more routes than the T0 so T1 switches take more time to process out routes before sending updates. The consequence is that, after BGP sessions’ establishment, T1 switches could receive BGP updates from the T0 before the T0 receives any BGP updates from the T1s. There will be a period that those T1s have routes learnt from the T0 but the T0 has no routes learnt from the T1(T0 has no default routes). In this period, Those T1s could send downstream traffic to this T0, as stated in 3.3.5, the T0 is still in standby state, it will try to forward the traffic via the tunnel. As the T0 has no default route in this period, those traffic will be blackholed.

So for the active-active T0s, a BGP update delay of 10 seconds is introduced to the BGP configurations to postpone sending BGP update after BGP session establishment. In this case, the T0 could learn routes from the T1s before the T1s learn any routes from the T0. So when the T1 could send any downstream traffic to the T0, the T0 will have default routes ready.

3.8.4 Skip adding ingress drop ACL

Previously, at a high level, when the mux port comes to standby, the MuxOrch add ingress ACL to drop packets on the mux port. And when the mux port comes to active, the MuxOrch remove the ingress ACL. As described in [3.6], the MuxOrch is acted an intermediate agent between LinkMgrd and the transceiver daemon. Before the NiC receives gRPC request to toggle standby, the ingress drop ACL has already been programmed by MuxOrch. In this period, the server NiC still regard this ToR as active and could send upstream traffic to this ToR, but the upstream traffic will be dropped by the installed ingress drop ACL rule.

A change to skip the installation of ingress drop ACL rule when toggling standby is introduced to forward the upstream traffic with best effort. This is because that, though the mux port is already in standby state in this period, the removal of the ingress drop ACL could allow the upstream traffic to reach the ToR and to be possibly forwarded by the ToR.

3.9 Command Line

This part only covers the command lines and options for active-active dualtor.

3.9.1 Show mux status

show mux status returns the mux status for mux ports:

PORT: mux port name
STATUS: current mux status, could be either active or standby
SERVER_STATUS: the mux status read from mux server as the result of last toggle
HEALTH: mux port health
HWSTATUS: check if current mux status matches server status
LAST_SWITCHOVER_TIME: last switchover timestamp

$ show mux status
PORT        STATUS    SERVER_STATUS    HEALTH    HWSTATUS    LAST_SWITCHOVER_TIME
----------  --------  ---------------  --------  ----------  ---------------------------
Ethernet4   active    active           healthy   consistent  2023-Mar-27 07:57:43.314674
Ethernet8   active    active           healthy   consistent  2023-Mar-27 07:59:33.227819

3.9.2 Show mux config

show mux config returns the mux configurations:

SWITCH_NAME: peer switch hostname
PEER_TOR: peer switch loopback address
PORT: mux port name
state: mux mode configuration
ipv4: mux server ipv4 address
ipv6: mux server ipv6 address
cable_type: mux cable type, active-active for active-active dualtor
soc_ipv4: soc ipv4 address

$ show mux config
SWITCH_NAME        PEER_TOR
-----------------  ----------
lab-switch-2  10.1.0.33
port        state    ipv4             ipv6               cable_type     soc_ipv4
----------  -------  ---------------  -----------------  -------------  ---------------
Ethernet4   auto     192.168.0.2/32   fc02:1000::2/128   active-active  192.168.0.3/32
Ethernet8   auto     192.168.0.4/32   fc02:1000::4/128   active-active  192.168.0.5/32

3.9.3 Show mux tunnel-route

show mux tunnel-route returns tunnel routes that have been created for mux ports.

For each mux port, there can be 3 entries: server_ipv4, server_ipv6, soc_ipv4. For each entry, if tunnel route is created in kernel or asic, you will see added in command output, if not, you will see -. If no tunnel route is created for any of the 3 entries, mux port won't show in the command output.

Usage:

show mux tunnel-route [OPTIONS] <port_name>
show muxcable tunnel-route <port_name>

Options:

--json          display the output in json format

Example

$ show mux tunnel-route Ethernet44
PORT        DEST_TYPE    DEST_ADDRESS       kernel    asic
----------  -----------  -----------------  --------  ------
Ethernet44  server_ipv4  192.168.0.22/32    added     added
Ethernet44  server_ipv6  fc02:1000::16/128  added     added
Ethernet44  soc_ipv4     192.168.0.23/32    -         added

3.9.4 Config mux mode

config mux mode configures the operational mux mode for specified port.

# config mux mode <operation_status> <port_name>
argument "<operation_status>" is  choose from:
        active,
        auto,
        manual,
        standby,
        detach.

4 Warm Reboot Support

TBD

Larch SONiC Documentation