Wazuh Load Balancing Issues with Nginx: Why All Agents Hit One Manager

TL;DR
All Wazuh agents were ending up on a single manager because the load balancer was using IP-hash stickiness. Because every agent connection passed through a Cloudflare Tunnel, the hash consistently mapped them to the same backend. Changing the 1514 upstream to least_conn balanced new connections across managers and fixed the skew. 1515 (registration) remained pointed at the master per Wazuh recommendations.

Background — how traffic flows

1515 — agent registration / agent-auth. This must reach the master so keys and enrollments are handled consistently.
1514 — agent data/logs (continuous monitoring). This is the port you normally load-balance across managers.

When Nginx proxies raw TCP with the stream module, the upstream selection algorithm determines which manager receives each connection.

What “stickiness” means (short)

Stickiness means the load balancer routes the same client to the same backend consistently. hash $remote_addr consistent; pins a client IP to a backend so that repeated connections from that IP go to the same manager.

What happened in my environment

I setup an NGINX load balancer, and followed Wazuh’s default instructions. However, I was noticing that all the agents were connecting to the same manager. I knew the manager worked correctly since when turning off Server 1, all agents seamlessly moved to Server 2.

Because all agent traffic was routed through a Cloudflare Tunnel, the load balancer saw the connections in a way that caused the IP-hash algorithm to assign all those agent connections to a single manager. The result: the secondary manager received few or no connections even though it was healthy and reachable.

The fix

I changed the 1514 upstream algorithm from hash $remote_addr consistent; to least_conn;. least_conn assigns new TCP connections to the manager with the fewest active connections, which evens load when many connections are long-lived or when upstream hashing concentrates traffic.

Before (sticky by IP):

stream {
   upstream master {
       server <MASTER_NODE_IP_ADDRESS>:1515;
   }
   upstream mycluster {
   hash $remote_addr consistent;
       server <MASTER_NODE_IP_ADDRESS>:1514;
       server <WORKER_NODE_IP_ADDRESS>:1514;
       server <WORKER_NODE_IP_ADDRESS>:1514;
   }
   server {
       listen 1515;
       proxy_pass master;
   }
   server {
       listen 1514;
       proxy_pass mycluster;
   }
}

After (balanced by connection count):

stream {
   upstream master {
       server <MASTER_NODE_IP_ADDRESS>:1515;
   }
   upstream mycluster {
   least_conn;
       server <MASTER_NODE_IP_ADDRESS>:1514 max_fails=3 fail_timeout=5s;
       server <WORKER_NODE_IP_ADDRESS>:1514 max_fails=3 fail_timeout=5s;
       server <WORKER_NODE_IP_ADDRESS>:1514 max_fails=3 fail_timeout=5s
   }
   server {
       listen 1515;
       proxy_pass master;
   }
   server {
       listen 1514;
       proxy_pass mycluster;
   }
}

Validation steps I ran

Validate and reload nginx:

nginx -t && systemctl reload nginx

Capture proxied traffic (quick check):

sudo tcpdump -n -i any 'tcp port 1514' -c 50

3. Restart a few test agents (or wait a while for natural network reconnections) and check the Dashboard

Note: least_conn affects new connections only. Existing persistent agent sessions remain where they are until they reconnect — expect the distribution to improve as agents renew connections over time.

Takeaways

The Wazuh docs’ hash $remote_addr consistent is a good default when agents present distinct source IPs and you want predictable agent → manager affinity.
When traffic is funneled through a proxy or tunnel (e.g., Cloudflare Tunnel) or you have many long-lived connections, least_conn gives a more even distribution for new connections.
Always keep registration (1515) pointed at the master.
Add max_fails/fail_timeout for basic resilience and use tcpdump or a stream access log to confirm backend distribution.

Yoshee08