automatic internet failover to LTE or another interface
So you have your own Linux router, and two separate internet connections, and you’d like to have your router switch to the failover one when the main one is acting up.
Good news, you’re at the right place :)
In this guide we’ll go through what needs to be done to have your box automatically switch to the failover interface and back. We’ll also talk about how you could send updates / notifications to your system of choice.
Let’s take a look at a summary of what we need to get done:
- decide which interface is going to be our main one, and which one is going to be the failover
- adjust the route metric of our secondary interface
- add iptables rules for our secondary interface
- create and customize our failover script
- add a systemd service that makes sure our script runs all the time
- test!
Grab a drink and let’s get going!
Figure out our interfaces⌗
In my case, enp1s0 is my main interface, and enp3s0 is my LTE failover interface.
Adjust the failover route metric⌗
Linux uses the default route with the lower metric.
We can leverage that to dynamically configure which route to use at any given time.
To make sure that by default our main interface is used, we set our main route to something low, and our failover route to something really high.
I use dhcpcd to set up my interfaces. To change my metric, I added this to dhcpcd.conf
:
interface enp1s0
metric 100
interface enp3s0
metric 99999
iptables⌗
Hopefully if you’re adding failover support to an existing router, you already have some rules in place. To make sure routing will work correctly, I also added some rules for the failover interface:
iptables -t nat -A POSTROUTING -o enp3s0 -j MASQUERADE
Create and customize our failover script⌗
This is what you’ve been waiting for!
Copy this script to /usr/local/bin/failover.sh
and customize the variables!
Read the comments for more details:
#!/bin/bash
# failover.sh
# v2.1 2022-01-26
# * fixed hardcoded interface and incorrect 0/1 values
# v2.0 2022-01-16
# Alex Alexander <alex.alexander@gmail.com>
# Your main internet interface
IF_MAIN="enp1s0"
# The interface you want to enable if IF_MAIN is not working
IF_FAILOVER="enp3s0"
# the metric to set the FAILOVER to when disabled
METRIC_FAILOVER_OFF="99999"
# the metric to set the FAILOVER to when ENABLED
METRIC_FAILOVER_ACTIVE="10"
# this number of pings has to fail for us to change state
FAILOVER_PING_THRESHOLD=2
# the hosts we ping to figure out if internet is alive.
# order matters, so we check two separate providers to make sure it's not the other end
HOSTS_TO_PING=(
"1.1.1.1"
"8.8.8.8"
"1.0.0.1"
"8.8.4.4"
)
# how long to waiting when testing main interface
PING_WAIT_MAIN=2
# how long to waiting when testing failover interface
PING_WAIT_FAILOVER=5
PING_LOOPS=1
# how often should we check
CHECK_MAIN_INTERVAL=10
# check whether IF_FAILOVER is working every X seconds
CHECK_FAILOVER_INTERVAL=600
# also check on start
CHECK_FAILOVER_COUNTER=${CHECK_FAILOVER_INTERVAL}
# my failover if is a little unstable, so when checking if it is working, we check twice
CHECK_FAILOVER_THRESHOLD=1
CHECK_FAILOVER_ROUTE=0
CHECK_FAILOVER_PING=0
DEBUG=false
if [[ "$1" == "-d" ]]; then
DEBUG=true
fi
# did we fail because the route was missing?
FAILOVER_DUE_TO_MISSING_ROUTE=false
LAST_STATE=
FAILOVER=false
PINGS_FAILED=0
PINGS_PASSED=0
# We use this method to update some external service.
function update_ha() {
echo "New State: ${@}"
#
# echo "Sending state to Home Assistant: ${@}"
# curl --header "Content-Type: application/json" \
# --request POST -o /dev/null -s \
# --data "{\"state\": \"${@^}\"}" \
# http://<some-host>/api/webhook/failover-status >/dev/null
LAST_STATE="${@}"
}
# This function knows how to check if pings work over an interface.
# It exports results to PINGS_PASSED and PINGS_FAILED
function check_pings() {
IF_TYPE=${1} # MAIN, FAILOVER
IF_NAME="IF_${IF_TYPE}"
IF=${!IF_NAME}
if [[ -z ${IF} ]]; then
echo "[EEE] Could not deduct IF from ${IF_TYPE}"
exit 1
fi
PING_WAIT_NAME="PING_WAIT_${IF_TYPE}"
PING_WAIT=${!PING_WAIT_NAME}
PINGS_FAILED=0
PINGS_PASSED=0
for ip in "${HOSTS_TO_PING[@]}"; do
ping -c ${PING_LOOPS} -W ${PING_WAIT} -I ${IF} "${ip}" 2>&1 >/dev/null
PING_RESULT=$?
if [[ ${PING_RESULT} -eq 0 ]]; then
PINGS_PASSED=$(( PINGS_PASSED + 1 ))
PINGS_FAILED=0
if [[ "${FAILOVER}" == true ]] || [[ "${DEBUG}" == true ]]; then
echo "[I] (failover: ${FAILOVER}) CHECKING ${IF_TYPE} IF: Ping to ${ip}/${IF} succeeded!"
fi
else
PINGS_PASSED=0
PINGS_FAILED=$(( PINGS_FAILED + 1 ))
echo "[E] (failover: ${FAILOVER}) CHECKING ${IF_TYPE} IF: Ping to ${ip}/${IF} FAILED"
fi
[[ ${PINGS_PASSED} -ge ${FAILOVER_PING_THRESHOLD} ]] && break
[[ ${PINGS_FAILED} -ge ${FAILOVER_PING_THRESHOLD} ]] && break
done
}
# Our main check function
function check() {
# first, check if our main interface route even exists
# if not, we can't really do anything, but we can update our state
if ! ip route list | grep default | grep -q ${IF_MAIN}; then
if [[ "${FAILOVER_DUE_TO_MISSING_ROUTE}" == false ]]; then
echo "[E] Could not find route for main interface (${IF_MAIN})"
FAILOVER_DUE_TO_MISSING_ROUTE=true
update_ha "Active (no route)"
fi
return
fi
# then, check if our failover interface route even exists
# we can't failover if there's no failover route ;)
# this is cheap, so we do it every time
if ! ip route list | grep default | grep -q ${IF_FAILOVER}; then
if [[ ${CHECK_FAILOVER_ROUTE} -lt ${CHECK_FAILOVER_THRESHOLD} ]]; then
echo "[W] Could not find route for failover interface, will retry (${IF_FAILOVER})"
CHECK_FAILOVER_ROUTE=$(( CHECK_FAILOVER_ROUTE + 1 ))
return
fi
echo "[E] Could not find route for failover interface (${IF_FAILOVER})"
update_ha "Unavailable (no route)"
return
fi
CHECK_FAILOVER_ROUTE=0
CHECK_FAILOVER_COUNTER=$(( CHECK_FAILOVER_COUNTER + CHECK_MAIN_INTERVAL ))
CHECK_FAILOVER_WAS_DONE=false
# every ~10m, send some pings over the failover interface to make sure it's
# actually working. If it's not, we can't do much to fix it automatically,
# but at least we can send out a notification to investigate, so we are not
# surprised later!
if [[ ${CHECK_FAILOVER_COUNTER} -ge ${CHECK_FAILOVER_INTERVAL} ]]; then
echo "Verifying Failover Internet is reachable"
check_pings FAILOVER
if [[ ${PINGS_FAILED} -ge ${FAILOVER_PING_THRESHOLD} ]]; then
if [[ ${CHECK_FAILOVER_PING} -lt ${CHECK_FAILOVER_THRESHOLD} ]]; then
echo "[W] Failover interface check pings failed, will retry (${IF_FAILOVER})"
CHECK_FAILOVER_PING=$(( CHECK_FAILOVER_PING + 1 ))
return
fi
update_ha "Unavailable (no ping)"
return
else
CHECK_FAILOVER_COUNTER=0
fi
CHECK_FAILOVER_WAS_DONE=true
fi
CHECK_FAILOVER_PING=0
STATE=
METRIC=$(ip route list | grep "^default" | grep "${IF_FAILOVER}" | sed "s:.*metric \([0-9]*\).*:\1:")
[[ ${METRIC} -eq ${METRIC_FAILOVER_OFF} ]] &&
FAILOVER=false || FAILOVER=true
if [[ "${FAILOVER}" == true ]]; then
DEFAULT_GW=$(ip route list | grep "^default" | grep "${IF_MAIN}" | sed "s:.*via \([.0-9]*\).*:\1:")
VIA="via ${DEFAULT_GW}"
else
VIA=""
fi
# we made it here, all routes seem to be present, let's check our main interface
check_pings MAIN
if [[ ${PINGS_FAILED} -lt ${FAILOVER_PING_THRESHOLD} ]]; then
STATE="Ready"
if [[ "${FAILOVER}" == true ]]; then
echo "[CHANGE] Ping through main IF {$IF_MAIN} worked, RESTORING"
# we need to re-write the route so it lowers the metric
FAILOVER_GW=$(ip route list | grep "^default" | grep "${IF_FAILOVER}" | sed "s:.*via \([.0-9]*\).*:\1:")
ip route del default via ${FAILOVER_GW}
ip route add default via ${FAILOVER_GW} dev ${IF_FAILOVER} metric ${METRIC_FAILOVER_OFF}
FAILOVER_DUE_TO_MISSING_ROUTE=false
fi
if [[ "${FAILOVER_DUE_TO_MISSING_ROUTE}" == true ]]; then
echo "[CHANGE] Main IF ${IF_MAIN} route came back, RESTORING"
FAILOVER_DUE_TO_MISSING_ROUTE=false
fi
else
STATE="Active (no ping)"
if [[ "${FAILOVER}" == true ]]; then
[[ "${DEBUG}" == true ]] &&
echo "(failover: true) Pings failed, but we've already failed over."
else
echo "[CHANGE] At least ${FAILOVER_PING_THRESHOLD} pings failed in a row, FAILING OVER"
# we need to re-write the route so it lowers the metric
FAILOVER_GW=$(ip route list | grep "^default" | grep "${IF_FAILOVER}" | sed "s:.*via \([.0-9]*\).*:\1:")
ip route del default via ${FAILOVER_GW}
ip route add default via ${FAILOVER_GW} dev ${IF_FAILOVER} metric ${METRIC_FAILOVER_ACTIVE}
fi
fi
if [[ ${STATE} != ${LAST_STATE} ]] || [[ "${CHECK_FAILOVER_WAS_DONE}" == true ]]; then
update_ha "${STATE}"
fi
}
echo "Internet Failover Script"
echo "---"
echo "Main Interface: ${IF_MAIN}"
echo "- Main Check: ${CHECK_MAIN_INTERVAL}s"
echo "Failover Interface: ${IF_FAILOVER}"
echo "- Failover Check: ${CHECK_FAILOVER_INTERVAL}s"
echo "==="
while true; do
check
sleep ${CHECK_MAIN_INTERVAL}
done
Whew :)
systemd service⌗
We’re getting there! Next up we need to set up a systemd service, which makes running our script easier.
Create /etc/systemd/system/failover.service
:
[Unit]
Description=Failover
[Service]
User=root
WorkingDirectory=/usr/local/bin
ExecStart=failover.sh
Restart=always
[Install]
WantedBy=multi-user.target
Make sure to edit the working directory and script name, then enable the service:
systemctl daemon-reload
systemctl enable failover
systemctl start failover
Check that things worked:
# systemctl status failover
● failover.service
Loaded: loaded (/etc/systemd/system/failover.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2022-01-16 18:29:19 PST; 9s ago
Main PID: 275142 (failover.sh)
Tasks: 2 (limit: 9467)
Memory: 852.0K
CPU: 168ms
CGroup: /system.slice/failover.service
├─275142 /bin/bash /usr/local/bin/failover.sh
└─275162 sleep 10
Jan 16 18:29:19 systemd[1]: Started Failover
Jan 16 18:29:19 failover.sh[275142]: Internet Failover Script
Jan 16 18:29:19 failover.sh[275142]: ---
Jan 16 18:29:19 failover.sh[275142]: Main Interface: enp1s0
Jan 16 18:29:19 failover.sh[275142]: - Main Check: 10s
Jan 16 18:29:19 failover.sh[275142]: Failover Interface: enp3s0
Jan 16 18:29:19 failover.sh[275142]: - Failover Check: 600s
Jan 16 18:29:19 failover.sh[275142]: ===
Jan 16 18:29:19 failover.sh[275142]: Verifying Failover Internet is reachable
Jan 16 18:29:19 failover.sh[275142]: Sending state to Home Assistant: Ready
Tests!⌗
At this point you should be ready to test! The script should be testing your internet already, so hopefully you’re not seeing errors already :P
To make sure things are actually working, let’s simulate some failures.
Hint: run journalctl -t "failover.sh" -f
to keep track of the failover logs.
- The easiest tests are unplugging cables. If you can do this, I recommend it.
- First, unplug your FAILOVER cable. This check runs every 10 seconds, so hopefully after ~20 seconds you should see logs mentioning the FAILOVER route is gone.
[E] Could not find route for failover interface (enp3s0)
Sending state to Home Assistant: Unavailable (no route)
- Plug the cable back in, and a few seconds later you should see another log saying failover is ready.
Sending state to Home Assistant: Ready
- Then, unplug your MAIN internet cable. Internet should switch to your failover and a log should show up.
- Finally, plug the MAIN cable back in.
- Internet might be failing even if the route is there. Let’s test that too.
- To simulate the MAIN internet not working, run this on your router:
iptables -A OUTPUT -o enp1s0 -j DROP
(replace enp1s0 with your main interface)- After a few seconds, you should see ping errors in the logs and the script should switch to FAILOVER
- To simulate the MAIN internet not working, run this on your router:
[E] (failover: 0) CHECKING MAIN IF: Ping to 1.1.1.1/enp1s0 FAILED
[E] (failover: 0) CHECKING MAIN IF: Ping to 8.8.8.8/enp1s0 FAILED
[CHANGE] At least 2 pings failed in a row, FAILING OVER
[I] Sending state to Home Assistant: Active (no ping)
- To undo ^, run,
iptables -D OUTPUT -o enp1s0 -j DROP
.
[I] (failover: 1) CHECKING MAIN IF: Ping to 1.1.1.1/enp1s0 succeeded!
[I] (failover: 1) CHECKING MAIN IF: Ping to 8.8.8.8/enp1s0 succeeded!
[CHANGE] Ping through main IF {enp1s0} worked, RESTORING
[I] Sending state to Home Assistant: Ready
- To test the FAILOVER check, run
iptables -A OUTPUT -o enp3s0 -j DROP
(replace enp3s0 with your failover interface)- FAILOVER checks run every 10 minutes, so either wait or
systemctl restart failover
[E] (failover: false) CHECKING FAILOVER IF: Ping to 1.1.1.1/enp3s0 FAILED
[E] (failover: false) CHECKING FAILOVER IF: Ping to 8.8.8.8/enp3s0 FAILED
Sending state to Home Assistant: Unavailable (no ping)
- To restore:
iptables -D OUTPUT -o enp3s0 -j DROP
Verifying Failover Internet is reachable
Sending state to Home Assistant: Ready
That’s all!
Hopefully everything’s working great and this post was helpful :) Feel free to comment if you have any questions. Cheers!