Issue: Load Balancer Errors Occur When Idle and Then Receiving a New Request
Today's unexpected load balancer behavior!
Although it hasn't been resolved yet, based on packet analysis through Wireshark:
Phenomenon:
When the VIP is called, if there is no call for a certain period and then a new call is made, a communication error occurs.
IP addresses are replaced with placeholders:
- WEB: Webserver IP
- API: Web API Server IP
No | Source | Destination | Source MAC | Destination MAC | Flag | Sequence | Window | Length | MSS | ACK | WS | SACK Perm | Description |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | WEB | API | MS_b7:1c:10 | Radware64 | SYN | 0 | 8192 | 0 | 1460 | 256 | 1 | Initiating communication with API server, are you ready? | |
2 | API | WEB | MS_b7:1c:39 | MS_b7:1c:10 | SYN+ACK | 0 | 8192 | 0 | 1460 | 1 | 256 | 1 | Ready! Send it over! |
3 | WEB | API | MS_b7:1c:10 | Radware64 | ACK | 1 | 131328 | 0 | 1 | Please execute this API request. | |||
4 | API | WEB | MS_b7:1c:38 | MS_b7:1c:10 | RST | 1 | 0 | 0 | What nonsense is this? |
Based on the packet analysis results shared with the infrastructure team:
- The root cause was an incorrect MAC address registration for the firewall acting as the gateway for VIP 150.
- Only the MAC address of the active L4 equipment should have been registered, but the backup L4 equipment's MAC address was also registered.
- (This is an occasional issue during initial setup when the ARP table isn’t cleared thoroughly.)
- Requests were sent to both the active and backup L4 equipment, causing the packets to be transmitted to both servers simultaneously, disrupting the sequence.
- As a result, the server reset the out-of-sequence packets faithfully.
- After deleting the VIP 150 ARP table on the firewall and clearing ARP tables on each server, the issue seemed resolved.
- The L4 load balancing was also restored to the round-robin method.
- The issue was resolved after reconfiguring the ARP table properly.