Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HLD] bmp for monitoring SONiC BGP info #1621

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

FengPan-Frank
Copy link

@FengPan-Frank FengPan-Frank commented Feb 22, 2024

[HLD] bmp for monitoring SONiC BGP info

Description Repo PR link State
Yang model definition sonic-buildimage sonic-net/sonic-buildimage#18530 GitHub issue/pull request detail
Add BMP_STATE_DB dedicated instance sonic-buildimage sonic-net/sonic-buildimage#19016 GitHub issue/pull request detail
Add BMP_STATE_DB support in swss-common sonic-swss-common sonic-net/sonic-swss-common#879 GitHub issue/pull request detail
bmpcfgd service implementation sonic-buildimage sonic-net/sonic-buildimage#18940 GitHub issue/pull request detail
config cli implementation sonic-utility sonic-net/sonic-utilities#3286 GitHub issue/pull request detail
show cli implementation sonic-utility sonic-net/sonic-utilities#3289 GitHub issue/pull request detail
openbmpd service implementation sonic-bmp sonic-net/sonic-bmp#7 GitHub issue/pull request detail
Add sonic-bmp submodule sonic-buildimage sonic-net/sonic-buildimage#20780 GitHub issue/pull request detail
Build bmp container into sonic-buildimage sonic-buildimage sonic-net/sonic-buildimage#18946 GitHub issue/pull request detail
Add bmp dynamic feature switch in frr side config. sonic-buildimage sonic-net/sonic-buildimage#20895 GitHub issue/pull request detail

doc/bmp/bmp.md Outdated
[OpenBMP](#https://www.openbmp.org/) is open source BGP monitor framework, which supports collecting, aggregating, data persisting, as well as dashboard, etc. For this project, we will fork this project with some internal code changes which populates the required data into Redis for our usage.

## 2.3 BGP/FRR update
Current [BGP](#https://github.com/sonic-net/sonic-frr) container has been forked from [FRR](#https://github.com/FRRouting/frr), which supports BMP protocol from version frr/7.2
Copy link
Contributor

@qiluo-msft qiluo-msft Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current [BGP] container has been forked from [FRR] -> Current BGP container uses a implementation (sonic-frr) forked from FRR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@FengPan-Frank FengPan-Frank force-pushed the fenpan_bmp branch 2 times, most recently from 1673a55 to d075afa Compare February 29, 2024 12:33
@FengPan-Frank FengPan-Frank force-pushed the fenpan_bmp branch 5 times, most recently from 6f2ae6c to 68234b1 Compare March 12, 2024 14:41
@zhangyanzhao
Copy link
Collaborator

This table will capture BGP Neighbor attributes supported on BGP neighbor, including capability it support.

```
admin@str-7050qx-32s-acs-04:~$ redis-cli -n 0 HGETALL BGP_NEIGHBOR_TABLE:10.0.0.61
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/sonic-net/SONiC/pull/1621/files#diff-0649e404e5d4496d9d6af8f6b9411cb483e3a6d376a927d8660baf476465d7b3R138
"SONiC to stream events occurring in SONiC switch (e.g BGP flap, process not running, ...) in real time via gNMI subscription model.
The events are defined using versioned YANG schema."

We have the above YANG model for streaming the BGP states today and looks like BGP_NEIGHBOR_TABLE supports extensive state information, hope we'll migrate to use one streaming method i.e BMP or events infra.

As above context mentioned, we need to bring up new container as bmp on SONiC, after change the relevant component architecture is as below:
<img src="images/architecture_diagram.png" alt="Architecture Diagram" width="800">

- Add new bmp container, which has limited resource control.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we new docker for running the "openBMPd" daemon?

I understand that the plan is to have,

  1. config "feature" command to enable/disable this.
  2. Limit CPU resource usage of the new docker .

Even if we leverage the existing frr container to run openBMPd, there are alternate ways to achieve the above, for instance #1 we could have a "config bmp <enable|disable>" which could internally start/stop the openBMPd process. And for #2 we could use command such as "nice/cpulimit" to set process usage limits.

From what i have seen adding a new container (base docker, with no overhead app) increases RAM usage by 20 to 30 MB, so just want to ensure if we really need a new container here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's indeed feasible as either way, and some tradeoff just exists as consideration.
The advantage of using dedicated container is that we can avoid packaging bmp feature onto some slim images if DUT resource is small and expensive relatively.

Create below config items list for enabling and disabling different table.

```
127.0.0.1:6379[4]> HGETALL FEATURE|bmp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FRR does support bmp enable/disable as a cmd-line arg (/usr/lib/frr/bgpd -A 127.0.0.1 -M bmp), to support per table level enable/disable as shown below, it's better to discuss this with FRR-community and introduce table level enable/disable support via FRR CLI knob, so that zebra/bgpd doesn't have to spend any cycle in sending change notifications for those tables to BMP ; Otherwise zebra/bgpd will always send notifications to BMP for the disabled tables and bmp ignores it later based on this config-db.

Please discuss with FRR community on this, as discussed in the HLD review community meeting (March-12-2024)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, currently FRR just post data to BMP following standard BGP protocol, thus neighbor table will be sent in BGP_MSG_OPEN message (#https://datatracker.ietf.org/doc/html/rfc4271#page-13) and all other tables including rib/ls_state... will be sent in BGP_MSG_UPDATE. A better option is to update in FRR side and controlled by the config/cli together, so that no extra data is transmitted between FRR and BMP. Will follow up with FRR community.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let's discuss with FRR community and enhance FRR capability to achieve this, as part of bringing in this feature to community-sonic.


## 1.1 Functional requirements

Now SONiC supports standard BGP session established with BGP neighbors, but there's only limited BGP info readable in database, which makes some cases difficult to debug, we need to dump more BGP data, meanwhile we're also expecting some live debug option which could dump BGP related info on demand and improve the debuggability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VTYSH show commands provide extensive dumps today for the debuggability, trying to understand the new information that we support part of openBMP.

On-change subscription for BGP data is clearly an use-case, I would like to see more filtering options (e.g RIB-in or our per neighbor) in the HLD so that we can limit the number entries going into the DB.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking smaller granularity control like per neighbor? instead of per table?.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking per neighbor level filtering at least and if possible, per table level filtering is good as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we will support table level filtering first, later could append neighbor level filtering if specific requirement asks.

bmp stats interval 1000
bmp monitor ipv4 unicast pre-policy
bmp monitor ipv6 unicast pre-policy
bmp connect 127.0.0.1 port 5000 min-retry 1000 max-retry 2000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to connect to the local collector and other option is to connect to the remote collector, if we have an use-case not to store into the redis DB but to only stream into the remote collector, are we planning to give such an option?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that will require dedicated BMP service being deployed and maintained, for now we're not planning of this.

doc/bmp/bmp.md Outdated

## 2.4 Database Design

BMP will continually populate existing redis table APPL_DB, just with different keys defined to cover the data so that functional requirement could be supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider pushing this state information into STATE_DB or new DB instance if that makes our life easy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

# 4 Resource usage and Test plan

### Resource usage
Since BMP is a sidecar daemon for improve the SONiC debuggability and monitor efficiency, we don't expect it take too much CPU usage, and prefer it could be able to work under low priority but without break. Now FRR to openbmp is working under push model, if limit cpu of bmp into 5%, it just operate slow, but without missing update, when we flap the bgp several times, all the flap updates will pushed to bmp finally. Thus we should be able to restrist CPU of BMP to lower priority without
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an exercise done to optimize the DB memory usage for storing the entries in DASH WG. Since we may store lot of RIB information into the DB, may be it's ideal to use protobuf format rather than text format for DB entries (refer the below link for more information).
https://github.com/sonic-net/DASH/blob/main/documentation/general/data/AppDBMemoryEstimation.xlsx

Copy link
Author

@FengPan-Frank FengPan-Frank Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will make some comparison for this, using pb should be feasible just the data will not be readable if we use redis-cli directly, seems now most of sonic database are text? I'm not sure @qiluo-msft may you comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, most database contents are in text format, we may need to check the possibility of memory optimization as we may push lot of RIB information into the DB.

@FengPan-Frank FengPan-Frank force-pushed the fenpan_bmp branch 2 times, most recently from edbe16c to 16f287f Compare March 14, 2024 14:57
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Apr 23, 2024
Why I did it
Add yang model support for feature sonic-net/SONiC#1621

Microsoft ADO (number only):27332639

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about this architecture design:
1)Why not BMP send route to external OpenBMP servers, which may keep more detailed BGP history based on Kafka.
2) Can BMP module in FRR write route to Redis directly? It is more straightforward.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1> That's will require us to maintain dedicated OpenBMP service, which is not in current scope, maybe later we can do that, if so it will be configurable simply from bmp perspective as well.
2> I think it will introduce unnecessary complexities into FRR, if like that FRR can persist Rib info directly instead of BMP, right? We choose BMP path now because BMP was natively supported in FRR in ASYNC mode, and not much extra cost added into FRR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Got the concept behind your design. Thanks~

qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Sep 19, 2024
#### Why I did it
sonic-net/SONiC#1621, need dedicated daemon to monitor config_db and manage openbmpd state.
#### How I did it
Use swss-common lib to monitor config_db change.
aidan-gallagher pushed a commit to aidan-gallagher/sonic-buildimage that referenced this pull request Nov 16, 2024
#### Why I did it
sonic-net/SONiC#1621, need dedicated daemon to monitor config_db and manage openbmpd state.
#### How I did it
Use swss-common lib to monitor config_db change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In Progress
Development

Successfully merging this pull request may close these issues.

7 participants