Data dog agents collect valuable data for the performance tuning on the machines.
HA proxies balancing is based on Route 53 DNS round robin (DNS RR) capabilities where each HAProxy is defined with the same A record and a unique DNS RR set id.
This first set of screenshots shows session data from two different HAProxies taken in the same second. If you pay attention to the Sessions Cur (current count) you can see how unevenly distributed connections we have between HAProxies.
Proxy 2: very loaded (~377 sessions per web server)
Proxy 3: not at all: Sessions Cur : 0-3
If we look the graph over time in the Datadog monitoring this is also clearly visible. Every color is a single HAProxy session count. It seems AWS Route 53 is balancing in a strange way, it's sending the first proxy IP to *ALL* the clients then after a few seconds sends the second IP to all clients. Hence all clients end up on the same HAProxy for a certain time-frame in the end. They are being balanced over time, but peaks are piling up only on a single HAProxy which is prone to be overloaded.
Since backend limits are calculated dynamically from the chef configuration management by the formula :
(websrv_instance_type_coeficient / count_of_haproxies)
In this case having more HAProxies actually does more harm, because the maximum connections per backend will be reduced as the system assumes that if the connections are properly balanced, then the total of all backend limits from the HAProxies will sum up. However in the case of uneven AWS DNS RR we get to a single HAProxy at a time. This comes with the risk that the backend limit per proxy could be hit.
Since in the load tests we're doing a different type of record is used, where a single A record holds all the destination IPs and then the client picks one and connects to it, we were not able to reproduce this. However according to AWS documentation there is a limitation for up to 8 IPs.
Q: “Does Amazon Route 53 support multiple values in response to DNS queries?”
A: “Route 53 now supports multivalue answers in response to DNS queries. While not a substitute for a load balancer, the ability to return multiple health-checkable IP addresses in response to DNS queries is a way to use DNS to improve availability and load balancing. If you want to route traffic randomly to multiple resources, such as web servers, you can create one multivalue answer record for each resource and, optionally, associate an Amazon Route 53 health check with each record. Amazon Route 53 supports up to eight healthy records in response to each DNS query."
We suggest as a first step to:
→ Reduce the amount of running HAProxies during a show to the maximum of 8, suggesting four. This way even if customers end up on the same proxy, the per backend, per proxy limits will be higher. According to the data, the proxy machines themselves are not under heavy load and are capable of handling more requests. The only limitation will be the active tcp connection limit of around 65000 tcp sessions per HAProxy, which means 4 should easily cover 200K open connections.
→ Switch to use a single DNS entry with multivalue answers. This has been tested with load tests and shows better distribution of requests.
As a long term solution:
→ Migrate to ELB - elastic load balancer which does real load balancing.
We implemented the fix to switch to multivalue DNS we also ensured that the limits calculation of the proxies takes into account the possibility of uneven distribution of requests due to the DNS round robin mechanism. After our fixes/help the next few shows were really smooth and producers were no longer getting complains from the viewers about their impossibility to vote. That felt great after all the sweat we’ve put into this performance tuning.
What do you think? Leave a comment below! ;)