Saturday, August 11, 2012

Issue: Load Balancer Errors Occur When Idle and Then Receiving a New Request

Issue: Load Balancer Errors Occur When Idle and Then Receiving a New Request

Today's unexpected load balancer behavior!

Although it hasn't been resolved yet, based on packet analysis through Wireshark:

Phenomenon:

When the VIP is called, if there is no call for a certain period and then a new call is made, a communication error occurs.

IP addresses are replaced with placeholders:

  • WEB: Webserver IP
  • API: Web API Server IP
No Source Destination Source MAC Destination MAC Flag Sequence Window Length MSS ACK WS SACK Perm Description
1 WEB API MS_b7:1c:10 Radware64 SYN 0 8192 0 1460   256 1 Initiating communication with API server, are you ready?
2 API WEB MS_b7:1c:39 MS_b7:1c:10 SYN+ACK 0 8192 0 1460 1 256 1 Ready! Send it over!
3 WEB API MS_b7:1c:10 Radware64 ACK 1 131328 0   1     Please execute this API request.
4 API WEB MS_b7:1c:38 MS_b7:1c:10 RST 1 0 0         What nonsense is this?

Based on the packet analysis results shared with the infrastructure team:
  • The root cause was an incorrect MAC address registration for the firewall acting as the gateway for VIP 150.
  • Only the MAC address of the active L4 equipment should have been registered, but the backup L4 equipment's MAC address was also registered.
  • (This is an occasional issue during initial setup when the ARP table isn’t cleared thoroughly.)
  • Requests were sent to both the active and backup L4 equipment, causing the packets to be transmitted to both servers simultaneously, disrupting the sequence.
  • As a result, the server reset the out-of-sequence packets faithfully.
  • After deleting the VIP 150 ARP table on the firewall and clearing ARP tables on each server, the issue seemed resolved.
  • The L4 load balancing was also restored to the round-robin method.
  • The issue was resolved after reconfiguring the ARP table properly.