Need help setting up a bond connection with active-backup between 2 ethernet (different ISPs)

Hi everyone. I have 2 internet connections (with 2 different ISPs). One is supposed to be my main connection (faster, supposedly better service overall) and the other one should act as a backup. I work delivering online training over Zoom and I can’t afford to be offline during training. I decided to get the second connection after a couple of brief connections issues with my main provider.

I understand I need to set up a bond connection in active-backup mode. I have followed several online tutorials, but things are not working as expected. I need help in whether to correct what I did or at least recalibrate my expectations. This is the last tutorial I tried.

What I expected

  • Open a Zoom meeting
  • Unplug my main connection
  • Meeting remains open
  • Normal internet browsing
    What happens today
  • Open a Zoom meeting
  • Unplug my main connection
  • Meeting shuts down (sometimes it reconnects after ~40 seconds)
  • Can’t browse any longer (sometimes it works after ~40 seconds)

My setup
ip addr show bond0


5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3a:1d:fa:76:ce:2d brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.81/24 brd 192.168.0.255 scope global dynamic noprefixroute bond0
       valid_lft 2387sec preferred_lft 2387sec
    inet6 2800:2200:3000:247::259/128 scope global dynamic noprefixroute 
       valid_lft 2384sec preferred_lft 2384sec
    inet6 2800:2200:3000:247:9308:d65d:7771:b7c/64 scope global dynamic noprefixroute 
       valid_lft 299sec preferred_lft 299sec
    inet6 fe80::d897:da0f:1d83:d15/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v6.1.31-2-MANJARO

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: enp7s0 (primary_reselect always)
Currently Active Slave: enp7s0
MII Status: up
MII Polling Interval (ms): 1
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enp6s0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:4b:fe:e0:b9:31
Slave queue ID: 0

Slave Interface: enp7s0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:4b:fe:e0:b9:30
Slave queue ID: 0

After running
sudo ip link set enp7s0 down
I run
cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v6.1.31-2-MANJARO

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: enp7s0 (primary_reselect always)
Currently Active Slave: enp6s0
MII Status: up
MII Polling Interval (ms): 1
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enp6s0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:4b:fe:e0:b9:31
Slave queue ID: 0

Slave Interface: enp7s0
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 5
Permanent HW addr: 24:4b:fe:e0:b9:30
Slave queue ID: 0

inxi

System:
  Kernel: 6.1.31-2-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.1.1
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.1-x86_64
    root=UUID=20705b87-eec6-4fb3-8353-031c51df1cac ro quiet splash apparmor=1
    security=apparmor udev.log_priority=3 nvidia_drm.modeset=1
  Desktop: GNOME v: 44.1 tk: GTK v: 3.24.37 wm: gnome-shell dm: GDM v: 44.1
    Distro: Manjaro Linux base: Arch Linux
Machine:
  Type: Desktop Mobo: ASUSTeK model: ROG STRIX X570-E GAMING v: Rev X.0x
    serial: <superuser required> UEFI: American Megatrends v: 4602
    date: 02/23/2023
Battery:
  Device-1: hid-CC20412029RJ2XQAQ-battery model: Apple Inc. Magic Trackpad 2
    serial: N/A charge: N/A status: discharging
  Device-2: hidpp_battery_0 model: Logitech Wireless Touch Keyboard K400 Plus
    serial: <filter> charge: 55% (should be ignored) rechargeable: yes
    status: discharging
CPU:
  Info: model: AMD Ryzen 9 5900X bits: 64 type: MT MCP arch: Zen 3+ gen: 4
    level: v3 note: check built: 2022 process: TSMC n6 (7nm) family: 0x19 (25)
    model-id: 0x21 (33) stepping: 0 microcode: 0xA201025
  Topology: cpus: 1x cores: 12 tpc: 2 threads: 24 smt: enabled cache:
    L1: 768 KiB desc: d-12x32 KiB; i-12x32 KiB L2: 6 MiB desc: 12x512 KiB
    L3: 64 MiB desc: 2x32 MiB
  Speed (MHz): avg: 2501 high: 3700 min/max: 2200/4950 boost: enabled
    scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 3592 2: 2200
    3: 2200 4: 2200 5: 2200 6: 2200 7: 2200 8: 2200 9: 2200 10: 2200 11: 2200
    12: 2200 13: 2920 14: 2301 15: 2200 16: 2200 17: 2874 18: 3700 19: 2873
    20: 3700 21: 2200 22: 2200 23: 2200 24: 2875 bogomips: 177329
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: retbleed status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: Retpolines, IBPB: conditional, IBRS_FW,
    STIBP: always-on, RSB filling, PBRSB-eIBRS: Not affected
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: Nanjing Magewell Eco Capture Dual HDMI M.2 vendor: SafeNet
    driver: Eco Capture v: N/A alternate: MWEcoCapture pcie: gen: 2 speed: 5 GT/s
    lanes: 4 bus-ID: 04:00.0 chip-ID: 1cd7:0051 class-ID: 0400
  Device-2: NVIDIA TU116 [GeForce GTX 1660 Ti] vendor: Micro-Star MSI
    driver: nvidia v: 530.41.03 alternate: nouveau,nvidia_drm non-free: 530.xx+
    status: current (as of 2023-05) arch: Turing code: TUxxx
    process: TSMC 12nm FF built: 2018-22 pcie: gen: 1 speed: 2.5 GT/s lanes: 16
    link-max: gen: 3 speed: 8 GT/s ports: active: none off: DP-1,DP-2,DP-3
    empty: HDMI-A-1 bus-ID: 0b:00.0 chip-ID: 10de:2182 class-ID: 0300
  Display: x11 server: X.Org v: 21.1.8 with: Xwayland v: 23.1.1
    compositor: gnome-shell driver: X: loaded: nvidia gpu: nvidia,nvidia-nvswitch
    display-ID: :1 screens: 1
  Screen-1: 0 s-res: 5440x1080 s-dpi: 96 s-size: 1439x286mm (56.65x11.26")
    s-diag: 1467mm (57.76")
  Monitor-1: DP-1 mapped: DP-0 note: disabled pos: right model: Dell P2418HT
    serial: <filter> built: 2019 res: 1920x1080 hz: 60 dpi: 93 gamma: 1.2
    size: 527x296mm (20.75x11.65") diag: 604mm (23.8") ratio: 16:9 modes:
    max: 1920x1080 min: 640x480
  Monitor-2: DP-2 note: disabled pos: center model: HS133PC built: 2020
    res: 1920x1080 hz: 60 dpi: 166 gamma: 1.2 size: 294x166mm (11.57x6.54")
    diag: 338mm (13.3") ratio: 3:2, 4:3 modes: max: 1920x1080 min: 640x480
  Monitor-3: DP-3 mapped: DP-4 note: disabled pos: primary,left
    model: HP E202 serial: <filter> built: 2020 res: 1600x900 hz: 60 dpi: 92
    gamma: 1.2 size: 443x249mm (17.44x9.8") diag: 508mm (20") ratio: 16:9
    modes: max: 1600x900 min: 640x480
  API: OpenGL Message: Unable to show GL data. Required tool glxinfo missing.
Audio:
  Device-1: Nanjing Magewell Eco Capture Dual HDMI M.2 vendor: SafeNet
    driver: Eco Capture alternate: MWEcoCapture pcie: gen: 2 speed: 5 GT/s
    lanes: 4 bus-ID: 04:00.0 chip-ID: 1cd7:0051 class-ID: 0400
  Device-2: NVIDIA TU116 High Definition Audio vendor: Micro-Star MSI
    driver: snd_hda_intel v: kernel pcie: gen: 1 speed: 2.5 GT/s lanes: 16
    link-max: gen: 3 speed: 8 GT/s bus-ID: 0b:00.1 chip-ID: 10de:1aeb
    class-ID: 0403
  Device-3: AMD Starship/Matisse HD Audio vendor: ASUSTeK
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 0d:00.4 chip-ID: 1022:1487 class-ID: 0403
  Device-4: Kingston HyperX Cloud Alpha S
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 2.0 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 7-3:3 chip-ID: 0951:16ed class-ID: 0300
    serial: <filter>
  API: ALSA v: k6.1.31-2-MANJARO status: kernel-api with: aoss
    type: oss-emulator tools: alsactl,alsamixer,amixer
  Server-1: JACK v: 1.9.22 status: off tools: N/A
  Server-2: PipeWire v: 0.3.70 status: off with: pipewire-media-session
    status: off tools: pw-cli
  Server-3: PulseAudio v: 16.1 status: active with: 1: pulseaudio-alsa
    type: plugin 2: pulseaudio-jack type: module tools: pacat,pactl,pavucontrol
Network:
  Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel pcie: gen: 2
    speed: 5 GT/s lanes: 1 bus-ID: 05:00.0 chip-ID: 8086:2723 class-ID: 0280
  IF: wlp5s0 state: up mac: <filter>
  Device-2: Realtek RTL8125 2.5GbE vendor: ASUSTeK driver: r8169 v: kernel
    pcie: gen: 2 speed: 5 GT/s lanes: 1 port: e000 bus-ID: 06:00.0
    chip-ID: 10ec:8125 class-ID: 0200
  IF: enp6s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  Device-3: Intel I211 Gigabit Network vendor: ASUSTeK driver: igb v: kernel
    pcie: gen: 1 speed: 2.5 GT/s lanes: 1 port: d000 bus-ID: 07:00.0
    chip-ID: 8086:1539 class-ID: 0200
  IF: enp7s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-1: bond0 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-2: bonding_masters state: N/A speed: N/A duplex: N/A mac: N/A
  IF-ID-3: br-ac907f97584b state: down mac: <filter>
  IF-ID-4: docker0 state: up speed: 10000 Mbps duplex: unknown mac: <filter>
  IF-ID-5: veth4d2e1eb state: up speed: 10000 Mbps duplex: full mac: <filter>
Bluetooth:
  Device-1: Intel AX200 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 1-6:3 chip-ID: 8087:0029
    class-ID: e001
  Report: rfkill ID: hci0 rfk-id: 1 state: down bt-service: enabled,running
    rfk-block: hardware: no software: yes address: see --recommends
Drives:
  Local Storage: total: 465.76 GiB used: 340.82 GiB (73.2%)
  SMART Message: Required tool smartctl not installed. Check --recommends
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Corsair model: Force MP600
    size: 465.76 GiB block-size: physical: 512 B logical: 512 B speed: 63.2 Gb/s
    lanes: 4 tech: SSD serial: <filter> fw-rev: EGFM13.0 temp: 38.9 C
    scheme: GPT
Partition:
  ID-1: / raw-size: 465.46 GiB size: 457.09 GiB (98.20%)
    used: 340.82 GiB (74.6%) fs: ext4 dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 312 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 33.0 C mobo: 34.0 C
  Fan Speeds (RPM): N/A
Info:
  Processes: 448 Uptime: 19m wakeups: 4 Memory: available: 15.53 GiB
  used: 2.98 GiB (19.2%) Init: systemd v: 253 default: graphical
  tool: systemctl Compilers: gcc: 13.1.1 clang: 15.0.7 Packages: 1730 pm: dpkg
  pkgs: 0 pm: pacman pkgs: 1718 libs: 440 tools: gnome-software,pacaur,pamac
  pm: flatpak pkgs: 5 pm: snap pkgs: 7 Shell: Zsh v: 5.9
  running-in: gnome-terminal inxi: 3.3.27

Thanks a lot for your help!!!

In that order? Perhaps make sure the connection you want is active before you start a meeting… :thinking:

Can you reproduce the issue with anything else besides Zoom?

Disclaimer: I am not a networking guru in any shape or form and have never used WAN bonding.

This is not how it works.

completely normal. After the switch to the backup, all network connections are terminated. After this all applications need to reinitialize their connections. Browser are not that good in detecting this. To check the backup use something like ping and check if your DNS is working.

Your ip command output shows a private IPv4 and a public IPv6. If your unlucky your private IPv4 stays the same but the IPv6 will change. Hopefully your IPv4 will also change.
Sometimes it is easier to switch at the router level. But most normal Home Router can’t do this.
Either way, there will be down time, 40 seconds is not bad, sometimes it is often minutes until all applications notice it.

But if you do this for a living invest in a better internet connection, get in touch with a b2b provider that can connect to you directly.

If you have two different ISPs, it doesn’t matter of your local two devices are bonded. To the outside worlds, you appear as two different IPs on different connections.

Usually, this is recoverable for browsing because it’ll happen so fast you won’t notice. But for a stateful connection of video streaming, it might not be fast enough or supported to roam your connection.

You could ask Zoom for support, maybe they know some settings which you could check?
Maybe the encryption doesn’t support this rapid switching?

In this case, I believe that bonding is not a good idea because your bonding on two different IPs (how should that even work?)
Usually, you should set the fast device one to a higher routing priority, and the alternative to a lower priority, that it’ll use it only if the first one breaks.

Thanks for helping me set my expectations right. My main connection is afaik the best one money can buy where I live (Buenos Aires, Argentina - in an area of town with no fiber-optic coverage). I will look for a solution at the router level then. I guess it makes sense to look for advice elsewhere. Thanks a lot for the pointer though!!

Another, probably unrelated, thing, but I’d try changing these values to something more sane, eg. polling interval 300 ms and updowndelay to 1500 ms.

Based on your description, if reliability is the biggest concern, I would invest in a Load Balancing router. I have set these up for three of my clients and they work well. Not sure if it would break the zoom call in the process of switching, but may be worth a look at.

At any rate, what you want to research is “load balancing router”

Example: TP-Link ER605 V2

Thanks! I managed to get a zoom-resistant setup by getting a Speedify account.