Performance tuning of a fully automated AWS environment started only on schedule

Performance tuning of a fully automated AWS environment started only on schedule

We’ve been contacted to conduct a tuning for performance for a huge AWS environment. It is used to host the server side of a mobile application for a TV show that gives the ability to its viewers to vote on various questions during the show. Since the show is scheduled only once per week, it’s a perfect use case for a cloud on-demand environment, that is only being raised during the show, then after it ends, all VMS are shutdown or destroyed. In this way costs are cut to the their minimum, no need to keep expensive private physical servers at all times in a private data center for this.  

You can imagine the load from the full set of viewers who are most of the times between 200 000 and 300 000 and they need to vote in parallel during one minute for a certain question. This could be a serious challenge for the system.

That’s why using chef and AWS OpsWorks the environment is spawned within minutes, a couple of hours before the show. Thanks to the sophisticated automation the environment is highly scalable and the number of machines can be adjusted based on the expected load. So we have in total of 15 HAProxies, 30 webservers running the django apps, then a dozen cache servers with memcached and using AWS Aurora and DynamoDB for data storage as well as the ElasticCache service of Amazon Web Services based on Redis. There are certain capabilities that can be carried out asynchronously and for this RabbitMQ and Celery is being used. All pictures and static content is kept in S3.

Data dog agents collect valuable data for the performance tuning on the machines.

HA proxies balancing is based on Route 53 DNS round robin (DNS RR) capabilities where each HAProxy is defined with the same A record and a unique DNS RR set id.

This first set of screenshots shows session data from two different HAProxies taken in the same second. If you pay attention to the Sessions Cur (current count) you can see how unevenly distributed connections we have between HAProxies.
Proxy 2: very loaded (~377 sessions per web server)

Proxy 3: not at all: Sessions Cur : 0-3

If we look the graph over time in the Datadog monitoring this is also clearly visible. Every color is a single HAProxy session count. It seems AWS Route 53 is balancing in a strange way, it's sending the first proxy IP to *ALL* the clients then after a few seconds sends the second IP to all clients. Hence all clients end up on the same HAProxy for a certain time-frame in the end. They are being balanced over time, but peaks are piling up only on a single HAProxy which is prone to be overloaded.

Since backend limits are calculated dynamically from the chef configuration management by the formula :
 (websrv_instance_type_coeficient / count_of_haproxies)

In this case having more HAProxies actually does more harm, because the maximum connections per backend will be reduced as the system assumes that if the connections are properly balanced, then the total of all backend limits from the HAProxies will sum up. However in the case of uneven AWS DNS RR we get to a single HAProxy at a time. This comes with the risk that the backend limit per proxy could be hit.
Since in the load tests we're doing a different type of record is used, where a single A record holds all the destination IPs and then the client picks one and connects to it, we were not able to reproduce this. However according to AWS documentation there is a limitation for up to 8 IPs.

Q: “Does Amazon Route 53 support multiple values in response to DNS queries?”
A: “Route 53 now supports multivalue answers in response to DNS queries. While not a substitute for a load balancer, the ability to return multiple health-checkable IP addresses in response to DNS queries is a way to use DNS to improve availability and load balancing. If you want to route traffic randomly to multiple resources, such as web servers, you can create one multivalue answer record for each resource and, optionally, associate an Amazon Route 53 health check with each record. Amazon Route 53 supports up to eight healthy records in response to each DNS query."


We suggest as a first step to:

→ Reduce the amount of running HAProxies during a show to the maximum of 8, suggesting four. This way even if customers end up on the same proxy, the per backend, per proxy limits will be higher. According to the data, the proxy machines themselves are not under heavy load and are capable of handling more requests. The only limitation will be the active tcp connection limit of around 65000 tcp sessions per HAProxy, which means 4 should easily cover 200K open connections.
→ Switch to use a single DNS entry with multivalue answers. This has been tested with load tests and shows better distribution of requests.

As a long term solution:
→ Migrate to ELB - elastic load balancer which does real load balancing.

We implemented the fix to switch to multivalue DNS we also ensured that the limits calculation of the proxies takes into account the possibility of uneven distribution of requests due to the DNS round robin mechanism. After our fixes/help the next few shows were really smooth and producers were no longer getting complains from the viewers about their impossibility to vote. That felt great after all the sweat we’ve put into this performance tuning.

What do you think? Leave a comment below! ;)