A single firewall is a single point of failure. For any environment where uptime matters — which is every production environment — you need high-availability firewalls. pfSense with CARP (Common Address Redundancy Protocol) provides enterprise-grade failover without enterprise-grade licensing costs. But the documentation rarely covers the failure modes that bite you at 2 AM. This guide covers real-world HA design, the asymmetric routing trap, automated failover testing, and VPN kill switch patterns.
Network Topology
The HA pair requires at least three network segments:
┌─────────────────┐
│ ISP / WAN │
└────────┬────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────▼──────┐ (CARP VIP) ┌──────────▼──────┐
│ fw-pri01 │ WAN VIP │ fw-sec01 │
│ (primary) │◄───────────►│ (secondary) │
│ │ pfsync │ │
│ WAN: .2 │◄───────────►│ WAN: .3 │
│ LAN: .2 │ SYNC net │ LAN: .3 │
│ SYNC: .1 │ │ SYNC: .2 │
└──────┬──────┘ └──────────┬───────┘
│ │
└────────────────┬───────────────┘
│
(CARP VIP)
LAN VIP
│
┌────────▼────────┐
│ Internal LAN │
│ (servers, etc) │
└─────────────────┘
IP Addressing Scheme
| Interface | fw-pri01 | fw-sec01 | CARP VIP |
|—|—|—|—|
| WAN | 203.0.113.2/24 | 203.0.113.3/24 | 203.0.113.1/24 |
| LAN | 10.0.1.2/24 | 10.0.1.3/24 | 10.0.1.1/24 |
| SYNC | 172.16.255.1/30 | 172.16.255.2/30 | N/A |
| OPT1 (DMZ) | 10.0.2.2/24 | 10.0.2.3/24 | 10.0.2.1/24 |
Clients and servers use the CARP VIP as their default gateway — never the physical address of either node.
pfsync Configuration
pfsync replicates the state table between nodes. Without it, established connections drop on failover because the secondary has no knowledge of existing sessions.
<!-- /cf/conf/config.xml — pfsync relevant section -->
<hasync>
<pfsyncenabled>on</pfsyncenabled>
<pfsyncinterface>opt2</pfsyncinterface> <!-- SYNC interface -->
<pfsyncpeerip>172.16.255.2</pfsyncpeerip>
<synchronizetoip>172.16.255.2</synchronizetoip>
<username>admin</username>
<password>ENCRYPTED_PASSWORD</password>
<!-- What to sync -->
<synchronizerules>on</synchronizerules>
<synchronizenat>on</synchronizenat>
<synchronizevirtualip>on</synchronizevirtualip>
<synchronizedhcpd>on</synchronizedhcpd>
<synchronizestaticroutes>on</synchronizestaticroutes>
<synchronizealiases>on</synchronizealiases>
</hasync>
Critical: The pfsync interface must be a dedicated, directly connected link between the two firewalls. Never route pfsync traffic through the production network. Use a crossover cable or a dedicated VLAN that carries only pfsync traffic. This traffic contains state table entries including connection metadata — treat it as sensitive.
The Asymmetric Routing Trap
This is the number one cause of HA firewall failures that survive initial testing but break under real traffic.
The problem: When the primary fails, the secondary takes over the VIP. Return traffic for existing connections may arrive at the secondary, but the SYN that established the connection went through the primary. Without proper state sync, the secondary sees a mid-stream packet with no matching state entry and drops it.
Even with pfsync, asymmetric routing can occur with multi-WAN or when upstream routers have stale ARP caches.
The fix:
# On BOTH pfSense nodes, under System > Advanced > Firewall & NAT:
# Set "Firewall Optimization Options" to "Conservative"
# Enable the sysctl that allows pfsync states to override local decisions
sysctl net.pf.pfsync_carp_adj=1
# Disable pf scrubbing for pfsync'd states (prevents RST on out-of-window segments)
# In /boot/loader.conf.local:
net.pf.pfsync_defer=1
Also configure gratuitous ARP on failover to force upstream switches to update their MAC tables:
# System > Advanced > Networking
# "Suppress ARP messages" = UNCHECKED (default)
# This ensures gratuitous ARPs are sent when CARP VIP transitions
CARP VIP Configuration
Configure CARP VIPs with explicit skew values. The primary should have a lower skew (higher priority):
# WAN CARP VIP
Virtual IP: 203.0.113.1/24
Type: CARP
Interface: WAN
VHID Group: 1
Advertising Frequency - Base: 1, Skew: 0 (PRIMARY)
Password: [shared CARP password]
# On secondary:
Advertising Frequency - Base: 1, Skew: 100 (SECONDARY)
Repeat for each interface (LAN, DMZ, etc.), using unique VHID groups per subnet.
Automated Failover Testing
Never trust HA that you have not tested. Build automated failover tests that run monthly:
#!/bin/bash
# failover_test.sh — Automated CARP failover validation
# Run from a monitoring host on the LAN
CARP_VIP="10.0.1.1"
PRIMARY_MGMT="10.0.1.2"
SECONDARY_MGMT="10.0.1.3"
TEST_TARGET="198.51.100.1" # External host for connectivity test
LOG="/var/log/failover-test.log"
log() { echo "$(date -u '+%Y-%m-%dT%H:%M:%SZ') $" >> "$LOG"; }
# Phase 1: Verify baseline
log "=== Failover Test Starting ==="
log "Phase 1: Baseline verification"
if ! ping -c 3 -W 2 "$CARP_VIP" > /dev/null 2>&1; then
log "FAIL: CARP VIP unreachable at baseline"
exit 1
fi
# Verify primary is MASTER
PRIMARY_STATUS=$(ssh admin@"$PRIMARY_MGMT"
"ifconfig | grep -A2 'carp:' | grep 'status:' | head -1" 2>/dev/null)
if [[ "$PRIMARY_STATUS" != "master" ]]; then
log "WARN: Primary is not MASTER at baseline: $PRIMARY_STATUS"
fi
# Phase 2: Start continuous connectivity monitor
log "Phase 2: Starting connectivity monitor"
PING_LOG=$(mktemp)
ping -i 0.2 "$CARP_VIP" > "$PING_LOG" 2>&1 &
PING_PID=$!
# Also start a TCP connection monitor (holds open connection through failover)
TCP_LOG=$(mktemp)
(while true; do
echo "$(date -u '+%H:%M:%S.%N') CHECK" |
nc -w 1 "$TEST_TARGET" 443 2>&1
sleep 0.5
done) > "$TCP_LOG" 2>&1 &
TCP_PID=$!
# Phase 3: Trigger failover on primary
log "Phase 3: Triggering failover (setting primary to BACKUP)"
ssh admin@"$PRIMARY_MGMT"
"pfSsh.php playback svc stop carp" 2>/dev/null
# Wait for failover to complete
sleep 5
# Phase 4: Verify secondary took over
log "Phase 4: Verifying secondary promotion"
SECONDARY_STATUS=$(ssh admin@"$SECONDARY_MGMT"
"ifconfig | grep -A2 'carp:' | grep 'status:' | head -1" 2>/dev/null)
if [[ "$SECONDARY_STATUS" == "master"* ]]; then
log "PASS: Secondary promoted to MASTER"
else
log "FAIL: Secondary did not promote: $SECONDARY_STATUS"
fi
# Test external connectivity through secondary
if ping -c 3 -W 2 "$TEST_TARGET" > /dev/null 2>&1; then
log "PASS: External connectivity maintained"
else
log "FAIL: External connectivity lost after failover"
fi
# Phase 5: Restore primary
log "Phase 5: Restoring primary"
ssh admin@"$PRIMARY_MGMT"
"pfSsh.php playback svc start carp" 2>/dev/null
sleep 10
# Phase 6: Analyze results
kill $PING_PID $TCP_PID 2>/dev/null
LOST_PINGS=$(grep -c "Request timeout" "$PING_LOG" 2>/dev/null || echo 0)
TOTAL_PINGS=$(wc -l < "$PING_LOG")
log "=== Results ==="
log "Total pings: $TOTAL_PINGS, Lost: $LOST_PINGS"
log "Failover packet loss: approximately ${LOST_PINGS} packets"
rm -f "$PING_LOG" "$TCP_LOG"
log "=== Failover Test Complete ==="
Expected results: CARP failover should complete in 2-4 seconds. You should see 1-3 lost pings (at 200ms intervals). If you see more than 10 lost pings, investigate gratuitous ARP propagation and upstream switch MAC learning times.
VPN Kill Switch Implementation
A VPN kill switch ensures that if the VPN tunnel goes down, traffic that should be encrypted is blocked rather than sent cleartext over the WAN. Implement this with pfSense firewall rules and gateway monitoring:
# Step 1: Create a gateway group
# System > Routing > Gateway Groups
Name: VPN_Failsafe
Gateway Priority:
- VPN_GW (Tier 1)
- WAN_GW (Tier 2, trigger level: Member Down)
# But we DON'T want WAN fallback — this is the kill switch
# Step 2: Instead, create rules that BLOCK when VPN is down
# Firewall > Rules > LAN
# Rule 1: Allow traffic through VPN gateway (when up)
Action: Pass
Interface: LAN
Source: LAN_SUBNETS
Destination: any
Gateway: VPN_GW
Description: "Route through VPN when available"
# Rule 2: Block everything else (kill switch)
Action: Block
Interface: LAN
Source: LAN_SUBNETS
Destination: any
Log: yes
Description: "KILL SWITCH — block if VPN is down"
For OpenVPN specifically, add the gateway monitoring:
# System > Routing > Gateways
# Edit the VPN gateway:
Monitor IP: 10.8.0.1 # VPN tunnel endpoint
Probe Interval: 1
Loss Interval: 2
Time Period: 5
Alert Interval: 1
Down: 3 (mark down after 3 lost probes = 3 seconds)
WAN Failover with Cellular Backup
For sites with a 4G/5G cellular backup WAN:
# Gateway Group for WAN failover:
Name: WAN_FAILOVER
- WAN_PRIMARY (Tier 1, trigger: Packet Loss or High Latency)
- WAN_LTE (Tier 2)
# Monitor both gateways
WAN_PRIMARY Monitor IP: 198.51.100.1 # Upstream ISP router
WAN_LTE Monitor IP: 198.51.100.2 # Carrier gateway
# Critical: Set different monitor IPs for each gateway
# If both monitor the same IP and that IP goes down,
# both gateways appear down simultaneously
SIEM Integration for Firewall State Monitoring
Push CARP state changes to your SIEM for alerting:
<!-- Wazuh custom decoder for pfSense CARP events -->
<decoder name="pfsense-carp">
<parent>syslog</parent>
<prematch>carp:</prematch>
<regex>carp: (S+) (S+): state transition: (S+) -> (S+)</regex>
<order>interface, vhid, old_state, new_state</order>
</decoder>
<!-- Alert on state transitions -->
<rule id="100850" level="10">
<decoded_as>pfsense-carp</decoded_as>
<field name="new_state">MASTER</field>
<description>pfSense CARP failover: $(interface) VHID $(vhid) became MASTER</description>
<group>firewall,ha,failover,</group>
</rule>
<rule id="100851" level="12">
<decoded_as>pfsense-carp</decoded_as>
<field name="new_state">INIT</field>
<description>pfSense CARP interface entered INIT state — possible split-brain</description>
<group>firewall,ha,critical,</group>
</rule>
Split-brain detection: If both nodes report MASTER simultaneously, you have a split-brain condition. This usually means the pfsync link is down. Create a correlation rule:
<rule id="100852" level="15" frequency="2" timeframe="10">
<if_matched_sid>100850</if_matched_sid>
<same_field>vhid</same_field>
<description>CRITICAL: Possible CARP split-brain — multiple MASTER transitions for same VHID</description>
<group>firewall,ha,split_brain,</group>
</rule>
Operational Checklist
Before declaring HA “done,” verify each item:
- [ ] Both nodes can independently route traffic (test by manually failing over)
- [ ] pfsync interface is on a dedicated link with no other traffic
- [ ] Gratuitous ARPs propagate to all upstream switches within 2 seconds
- [ ] VPN tunnels re-establish automatically after failover (check DPD settings)
- [ ] DHCP leases are synced (verify a client can renew after failover)
- [ ] Config sync works bidirectionally (make a change on primary, verify on secondary)
- [ ] Monitoring alerts fire on state transitions
- [ ] Automated failover test runs monthly and reports results
- [ ] Backup cellular WAN activates within acceptable timeframe (typically 10-30 seconds)
- [ ] Kill switch blocks cleartext traffic when VPN gateway is marked down
HA firewalls are not a set-and-forget deployment. Test failover regularly, monitor state transitions, and treat the pfsync link as critical infrastructure. The 2 AM outage that exercises your HA for the first time in production is not when you want to discover that asymmetric routing drops half your connections.
