SONiC Weighted ECMP

High Level Design document

Table of contents

Revision

RevDateAuthorDescription
0.124/01/2024Nazarii HnydynInitial version

About this manual

This document provides general information about WCMP implementation in SONiC

Scope

This document describes the high level design of WCMP feature in SONiC

In scope:​

  1. WCMP L3 configuration

Out of scope:​

  1. WCMP EVPN Type-5 configuration

Abbreviations

TermMeaning
SONiCSoftware for Open Networking in the Cloud
WCMPWeighted-Cost Multi-Path
ECMPEqual-Cost Multi-Path
UCMPUnequal-Cost Multi-Path
EVPNEthernet Virtual Private Network
FRRFree Range Routing
BGPBorder Gateway Protocol
IPInternet Protocol
L3Layer 3
OAOrchestration Agent
DBDatabase
NHNext Hop
NHGNext Hop Group
ToRTop-Of-Rack
APIApplication Programming Interface
SAISwitch Abstraction Interface
ASICApplication-Specific Integrated Circuit
SWSSSwitch State Service
CLIСommand-line Interface
JSONJavaScript Object Notation
YANGYet Another Next Generation
PYTESTPython Testing Framework
VSVirtual Switch
PTFPacket Test Framework

1 Introduction

1.1 Feature overview

In normal ECMP, the route to a destination has multiple next hops and traffic is expected to be equally distributedacross these next hops. In practice, flow-based hashing is used so that all traffic associated with a particular flowuses the same next hop, and by extension, the same path across the network.

Weighted ECMP using BGP link bandwidth introduces support for network-wide UCMP to an IP destination.The unequal cost load balancing is implemented by the forwarding plane based on the weights associatedwith the next hops of the IP prefix. These weights are computed based on the bandwidths of the correspondingmultipaths which are encoded in the BGP link bandwidth extended community. Exchange of an appropriateBGP link bandwidth value for a prefix across the network results in network-wide unequal cost multipathing.

WCMP is applicable in a pure L3 network as well as in a EVPN network.

1.2 Requirements

1.2.1 Functionality

This feature will support the following functionality:​

  1. WCMP BGP device global configuration

  2. Warm/Fast reboot

1.2.2 Command interface

This feature will support the following commands:​

  1. config: set switch WCMP global configuration

  2. show: display switch WCMP global configuration

1.2.3 Error handling

1.2.3.1 Frontend

This feature will provide error handling for the next situations:​

  1. Missing parameter value

  2. Invalid parameter value

1.2.3.2 Backend

This feature will provide error handling for the next situations:​

  1. Invalid parameter value

1.2.4 Event logging

1.2.4.1 Frontend

This feature will provide event logging for the next situations:​

  1. Configuration update
Table 1: Frontend event logging
EventSeverity
Configuration update: successNOTICE
Configuration update: errorERROR

1.2.4.2 Backend

This feature will provide event logging for the next situations:​

  1. Invalid parameter value

  2. Parameter removal

  3. Configuration removal

  4. Configuration update

Table 2: Backend event logging
EventSeverity
Invalid parameter valueERROR
Parameter removalNOTICE
Configuration removalNOTICE
Configuration update: successNOTICE
Configuration update: errorERROR

2 Design

2.1 Overview

Figure 1: WCMP design

A network fabric is a type of network topology where all nodes, in this case switches and endpoints,are interconnected to all other nodes. In such a topology, full bisection bandwidth is often expected.The full bisection bandwidth allows one half of the network nodes to communicate simultaneouslywith the other half of the nodes.

When a link fails, it is important not only for routing topology to converge quickly, but also convergein a way that can maintain the best performance possible. For example, with multiple links connectingeach ToR and Spine, one link failing does not change reachability, but has big implicationsin terms of performance due to reduced capacity.

Weighted-Equal Cost-Multipath (W-ECMP) offers a solution to this problem in which the relative link weightreflecting the available capacity is distributed using the BGP bandwidth community attribute.

Dataplane configuration flow:

  • The BGP link bandwidth extended community carries information about the available link capacity for the routeprefixes through the network, which gets mapped to the weight of the corresponding next hop. The mappingfactors the bandwidth value of a particular path against the total bandwidth values of all possible paths,normalized to the range 1 to 255

  • FRR converts the incoming BGP link bandwidth extended community values into proportionated weightamong the ecmp members in such a way that the cumulative value of individual weights is normalized to 255

  • BGP gets notified about the weight change for the prefixes towards the switch that experienced the failureof the link connecting the Spine and the remote ToR

  • FRR/Zebra recalculates the weight and makes a new NHG

  • NH weights on routes are reported to fpmsyncd and further via Redis DB to Route OA

  • Route OA programs weight values to syncd via SAI Redis

  • ASIC is provisioned with weight values by syncd via SAI API

The NHG member weights can be used by a dataplane to do better forwarding decisions.

2.2 FRR WCMP configuration

The configuration for weighted ECMP using BGP link bandwidth requires using a route-map to injectthe link bandwidth extended community.

There is no configuration necessary to process received link bandwidth and translate it into the weightassociated with the corresponding next hop; that happens by default. If some of the multipaths do not havethe link bandwidth extended community, the default behavior is to revert to normal ECMP.

At the entry point router that is injecting the prefix to which weighted load balancing must be performed,a route-map must be configured to attach the link bandwidth extended community.

For the use case of providing weighted load balancing for an anycast service, this configuration will typicallyneed to be applied at the ToR or Leaf router that is connected to servers which provide the anycast serviceand the bandwidth would be based on the number of multipaths for the destination.

Note:​ FRR documentation at FRR BGP manual

2.2.1 L3 network

Skeleton code:​

!
route-map wcmp-map permit 100
  set extcommunity bandwidth num-multipaths
exit
!
router bgp 65100
  neighbor SPINE peer-group
  neighbor 1.1.1.1 peer-group SPINE
  neighbor 1.1.1.1 remote-as 65200
  neighbor 2.2.2.2 peer-group SPINE
  neighbor 2.2.2.2 remote-as 65200
  !
  address-family ipv4 unicast
    neighbor SPINE route-map wcmp-map out
    neighbor SPINE activate
  exit-address-family
  !
end

2.2.2 EVPN network

Skeleton code:​

!
route-map wcmp-map permit 100
  set extcommunity bandwidth num-multipaths
exit
!
router bgp 65100 vrf vrf1
  neighbor SPINE peer-group
  neighbor 1.1.1.1 peer-group SPINE
  neighbor 1.1.1.1 remote-as 65200
  neighbor 2.2.2.2 peer-group SPINE
  neighbor 2.2.2.2 remote-as 65200
  !
  address-family l2vpn evpn
    advertise ipv4 unicast route-map wcmp-map
    neighbor SPINE activate
  exit-address-family
  !
end

Note:​ EVPN configuration is only relevant when docker_routing_config_mode is either split or split-unified

2.3 Configuration daemon

2.3.1 Overview

Figure 2: WCMP daemon design

The existing DeviceGlobalCfgMgr class will be extended with a new APIs to implement WCMP feature.Configuration daemon will be extended with a new WCMP Config DB schema and set/unset template support.WCMP updates will be processed by configuration daemon based on Config DB changes.Some updates will be handled and some will be considered as invalid.

2.3.2 SWSS OA

NHG member weight support is already part of SWSS.No further modifications are required.

Note:​ HLD link is WCMP SWSS HLD

2.3.3 BGP daemon

Class DeviceGlobalCfgMgr holds a set of methods matching generic Manager class pattern to handleConfig DB updates. For that purpose a SubscriberStateTable mechanism (implemented in sonic-swss-common)is used. Method DeviceGlobalCfgMgr::handler will be called on WCMP update. It will distribute handlingof DB updates between other handlers based on the table key updated (Redis Keyspace Notifications).

This class is responsible for:

  1. Processing updates of WCMP

  2. Partial input data validation

  3. Replicating data from Config DB to FRR configuration

  4. Caching objects in order to handle updates

WCMP object is stored under BGP_DEVICE_GLOBAL|STATE key in Config DB. On BGP_DEVICE_GLOBAL update,method DeviceGlobalCfgMgr::set_wcmp will be called to process the change.Regular WCMP update will refresh the internal class structures and appropriate FRR objects.

Skeleton code:​

class DeviceGlobalCfgMgr(Manager):
    """ This class responds to change in device-specific state """

    def __init__(self, common_objs, db, table):
        """
        Initialize the object
        :param common_objs: common object dictionary
        :param db: name of the db
        :param table: name of the table in the db
        """
        ...

        self.wcmp_template = common_objs['tf'].from_file("bgpd/wcmp/bgpd.wcmp.conf.j2")

    def set_handler(self, key, data):
        """ Handle device TSA/WCMP state change """
        ...

        self.configure_tsa(data)
        self.configure_wcmp(data)

        return True

    def del_handler(self, key):
        """ Handle device TSA/WCMP state remove """
        ...

        self.configure_tsa()
        self.configure_wcmp()

        return True

    def is_update_required(self, key, value):
        """ API to check if configuration update required """
        ...

    def configure_tsa(self, data=None):
        """ Configure TSA feature"""
        ...

    def configure_wcmp(self, data=None):
        """ Configure WCMP feature"""
        ...

Note:​ WCMP BGP configuration via FRR management framework is not considered

2.3.4 FRR configuration templates

BGP daemon uses policies.conf.j2 at sonic-buildimage/dockers/docker-fpm-frr/frr/bgpd/templates/generalin order to generate relevant route-map infrastructure.

WCMP BGP requires using an outbound route-map to inject the link bandwidth extended community.For this purpose, the next definitions will be used:

  • TO_BGP_PEER_V4

  • TO_BGP_PEER_V6

BGP daemon will use bgpd.wcmp.conf.j2 at sonic-buildimage/dockers/docker-fpm-frr/frr/bgpd/wcmp to enable/disable WCMP.

Skeleton code:​

!
! template: bgpd/wcmp/bgpd.wcmp.conf.j2
!
route-map TO_BGP_PEER_V4 permit 100
{%- if wcmp_enabled == 'true' %}
  set extcommunity bandwidth num-multipaths
{%- else %}
  no set extcommunity bandwidth
{%- endif %}
exit
!
route-map TO_BGP_PEER_V6 permit 100
{%- if wcmp_enabled == 'true' %}
  set extcommunity bandwidth num-multipaths
{%- else %}
  no set extcommunity bandwidth
{%- endif %}
exit
!
! end of template: bgpd/wcmp/bgpd.wcmp.conf.j2
!

2.4 DB schema

2.4.1 Config DB

; defines schema for WCMP configuration attributes
key = BGP_DEVICE_GLOBAL|STATE ; device global state. Must be unique

; field      = value
wcmp_enabled = wcmp-state ; enable/disable WCMP using BGP link bandwidth

; value annotations
wcmp-state = "true" / "false"

2.4.2 Data sample

Config DB:​

redis-cli -n 4 HGETALL 'BGP_DEVICE_GLOBAL|STATE'
1) "wcmp_enabled"
2) "true"

2.4.3 Configuration sample

WCMP configuration:​

{
    "BGP_DEVICE_GLOBAL": {
        "STATE": {
            "wcmp_enabled": "true"
        }
    }
}

2.4.4 Initial configuration

WCMP initial configuration will be updated at sonic-buildimage/files/build_templates/init_cfg.json.j2in order to expose the default state.

Skeleton code:​

{
    ...

    "BGP_DEVICE_GLOBAL": {
        "STATE": {
            "wcmp_enabled": "false"
        }
    },

    ...
}

2.5 Flows

2.5.1 Config section

2.5.1.1 WCMP update

Figure 3: WCMP update flow

2.5.2 Show section

2.5.2.1 WCMP show

Figure 4: WCMP show flow

2.6 CLI

2.6.1 Command structure

User interface:

config
|--- bgp
     |--- device-global
          |--- wcmp <ARG>

show
|--- bgp
     |--- device-global OPTIONS

Arguments:​

config bgp device-global wcmp

  1. enabled|disabled - enable/disable WCMP using BGP link bandwidth

Options:​

show bgp device-global

  1. -j|--json - display in JSON format

2.6.2 Usage examples

2.6.2.1 Config command group

The following command updates WCMP configuration:​

config bgp device-global wcmp enabled

2.6.2.2 Show command group

The following command shows WCMP configuration:​

show bgp device-global
TSA        WCMP
---------  --------
Disabled   Enabled

show bgp device-global --json
{
    "tsa": "disabled",
    "wcmp": "enabled"
}

2.7 YANG model

Existing YANG model sonic-bgp-device-global.yang at sonic-buildimage/src/sonic-yang-models/yang-modelswill be extended with a new schema in order to provide support for WCMP.

Skeleton code:​

module sonic-bgp-device-global {
    yang-version 1.1;

    namespace "http://github.com/sonic-net/sonic-bgp-device-global";
    prefix bgp_device_global;

    organization "SONiC";
    contact "SONiC";
    description "SONIC Device-specific BGP global data";

    revision 2024-01-28 {
        description "Add Weighted ECMP using BGP link bandwidth";
    }

    revision 2022-06-26 {
        description "Initial revision";
    }

    container sonic-bgp-device-global {

        container BGP_DEVICE_GLOBAL {

            container STATE {

                ...

                leaf wcmp_enabled {
                    description "Enables/Disables Weighted-Equal Cost-Multipath using BGP link bandwidth";
                    type boolean;
                    default "false";
                }

            }
            /* end of STATE container */
        }
        /* end of BGP_DEVICE_GLOBAL container */
    }
    /* end of sonic-bgp-device-global container */
}
/* end of module sonic-bgp-device-global */

2.8 Warm/Fast boot

No special handling is required

3 Test plan

3.1 Unit tests via PYTEST

WCMP basic configuration test:

  1. Verify FRR route-map is generated properly after WCMP config update

3.2 Unit tests via VS

TBD

3.3 Data plane tests via PTF

TBD