The Dream11 Tech Behind Creating a Seamless ‘Dream11 IPL’ Experience

Blog/ October 15, 2020

In our country, cricket has always been more than a sport. And right from its inception, the Indian Premier League has held the “premiere” spot in our hearts. The fan’s love for the league, their craze for the players and endless cheering for teams, has increased multifold over the years. This connection and love, between the league and the fans, cannot be summarised in a few words. 

The same goes for everyone here at Dream11. Each year, IPL serves as a defining event for us. From setting new benchmarks to opening new horizons, the league is the most awaited sporting event for our tech teams. Owing to the onset of the COVID pandemic, sporting events were cancelled throughout the world. The news of a postponed IPL was as disappointing for us at Dream11 as it was for the fans. Then came the much-awaited announcement of the IPL 2020 – the longest version of IPL till date! And as if the news of IPL 2020 taking place & sports returning wasn’t good enough, we won the TITLE Sponsor rights of this year, making this IPL the Dream11 IPL!!  

During this time, Dream11 IPL 2020 is the only major live cricket sporting entertainment available to cricket fans! And a big opportunity for all the tech teams at Dream11 to provide seamless and secure app experience to our users! Being named the title sponsor meant more responsibility and work for all, and every single team was revved up to go. We mobilised to relook, identify and gear up for action as the scale of inflow just reached a new level – unprecedented and unpredictable! 

A real-time report on inbound traffic

Another challenge was to work and improve processes for every system while working remotely. With dedicated war rooms for every team in place, we were armed and ready to handle all situations with confidence. Despite working remotely from different parts of our country,  our tenacity to spend hours collaborating, fixing processes, streamlining systems, while keeping each other’s morale high – kept us surging ahead. And this is also what defines our culture at Dream11. But as always, there is a lot more to it than what meets the eye. This blog talks about how we, the team of Dream11, prepared for handling this spectacular event.

Creating the Infrastructure Backbone of Dream11

Dream11’s fantasy sports platform is a design framework created in synchrony with different, smaller and self-sustaining teams called Dream Teams. With almost 100 microservices being deployed among them, it was a humongous challenge for us to provision resources and sustain the scale of the inbound traffic for these Dream Teams. Along with the constant flow of users, which in itself is a huge number, handling a sudden spike in traffic was also a big task. 

An example to demonstrate the spike in traffic during special events

We needed to create an infrastructure that was capable of handling this spike and this feat could only be achieved by creating multiple parallel load testing environments that allowed us to get a base estimate of the traffic. 

An overview of the baseline traffic estimation process

The team faced another challenge right off the bat – creating a platform that allows all the different Dream Teams to perform and sustain traffic in the most elegant fashion. We were responsible for creating a unified infrastructure for the Dream Teams to scale within a few minutes. By packaging the code artifacts in Amazon Machine Image, we fine-tuned parameters making provisioning faster at the time of auto-scaling. This led the auto-scaling group to run spot at 100% for the optimised cost. 

Handling Enormous Inbound Requests Proficiently

The user concurrency experienced an unpredictable surge. However, the team made the process of predicting the workload; easy. This allowed us to revamp our model, which only operated at 40% compute capacity leaving the remainder as a backup for traffic spikes. Therefore, we decided to create a new method which would use the system’s compute capacity efficiently. This new method of automation and quick scaling reduced the overall resource allocation and increased the confidence of the team, encouraging us to perform better for any kind of traffic spike.

The surge in traffic volume experienced by our platform this year

With such a quantum of traffic, we also had to make sure that any unforeseen failure in one service must not affect the other parts of the system. Imagine keeping tabs on almost 100 microservices, distributed across multiple teams with varied timelines; an impossible task indeed!

We believed otherwise! Going back to the drawing board, we understood the pain points of the issue and sprung in action updating the system to be more reactive. This happened with two main additions – Backpressure and Circuit breakers. We configured our critical services to create backpressure in case it gets more requests than it can handle and drop requests at the edge. It helped our system to remain predictable and reactive to any burst of traffic. On the other hand, with circuit breakers the services became sentient and self-securing making them stop the flow of upstream requests for some time and secure itself from thread starvation until upstream service recovers. This allowed the system to recover without affecting other services, ensuring streamlined operations. But we also made sure that the system could handle the extra traffic along with improving performance. 

Another big challenge that we faced was, to calculate ranks of ever-changing points of user teams in real-time. Even though the total number of people playing in a single contest increased to an unprecedented scale, we were able to sort and calculate ranks and distribute prizes even faster than last year.

Real-time comparison of the improvement done to the leaderboard

Optimizing for those extreme cases

Though the infrastructure and performance were in check, our team still had to ensure a solid user experience in an unlikely scenario of a system failure. This required us to tinker the clients. The mandate was to ensure there is a minimum drop in a user’s happiness quotient in case of a system failure. For this, we introduced the principle of Graceful degradation

The idea had two aspects — first, instead of making one huge API call for a page to the backend, we now make multiple smaller requests for each sub-section of the page. Though this increased our total number of requests to the backend, it was well under the limit of what our systems could handle. Making the requests more pointed helped in handling failures locally by each sub-sections of the page. Second, instead of showing server side error messages such as “502 Bad Gateway” through a toast message, we customized the user interface for each error state with a less disappointing infographic.

Using the principle of Graceful degradation in our application

By taking the happiness quotient into account, the team was able to handle the errors more efficiently thus allowing them to not take a hit on user experience. For an app like Dream11, the user experience acts as a big factor. Therefore, by keeping track of all the requests and failures, we completed the task of creating a streamlined app for our users. 

The next critical task was to think about our 95 and the 99 percentile of users. These users were typically on a 2G connection with low-end devices, trying to make a team as the round clock is closing towards the match start time! Adhering to our fair play policy, we wanted these users to not lose any edge in the participation in the contest because they are on a poor network.

In order to do this, we decided to build a system that would find the most optimized way of using the network, an Intelligent Network Manager if you may! We leveraged HTTP2’s persistent connections and it’s compression & multiplexed data transmission capabilities to prioritize certain calls over others. This made the system reuse the same connection, reducing the SSL handshakes. Moreover, we also had use cases where there were too many calls that were choking the connection. This new system enabled us to throttle such requests by considering the latest data or batch them based on some internal heuristics. 

Recorded improvement in the performance of Dream11 app

In an app environment where a delay of even a single second creates a different impact among users, this rate of improvement served as a massive contribution to our fair play policy!

Testing All Services With Perfection

When it comes to the overall performance of an app, with over 90 million users, application stability is a crucial factor for the entire team. While achieving it is a herculean task in itself, the added responsibility of about 100 backend services, a thousand API and app releases in 4 clients namely Android, iOS, Desktop and PWA didn’t make it any easier. Imagine doing all this, and then also trying to keep up with frequent releases, knowledge on all the systems and quality checks. Phew! Calling it a challenge is an understatement. However, this is what our team had to tackle. 

Every user has a different device with different configurations. Therefore, testing in itself can be quite challenging. Even a downtime of a few seconds during a crucial event in the app could result in giving a bad user experience, and this was not something we could compromise on. Therefore, the team needed to be 100% sure about the compatibility, reliability and availability before releasing it for deployment. The regression time during a key event like IPL also factors into the mix. But the team had an ace up their sleeves. By building a versatile automation process at every layer, we made catching bugs easier making automation at each layer the gatekeeper. We had been working on using Cucumber for BDD tests and Mobile App Automation using Appium for a year to make manual testing easier. Now in place, its integration with Jenkins made validating releases and integration build of any unfamiliar service very easy. The result was nothing less than an achievement. The team was able to churn out a better quality of DEV builds allowing them to make an app release every week!

Results of optimising the testing process

Managing a Quantum of Data With Ease

As part of a sports tech company, Dream Sports, Dream11 has been witnessing data growth on a massive scale since inception. However, our data setup, which consisted of a combination of inhouse and 3rd party data platforms was not enough to keep up with the magnitude of data we came to collect. We understood right from the beginning that we, as a brand wanted to ensure complete data security and transparency of the user data. With a 3rd party data source, this would not be possible. 

The reasons behind the inception of Data Highway

It was not how Dream11 wanted to operate! That’s when we decided to embark on a journey to create an inhouse end-to-end data platform that tackled all these problems while ensuring complete data security and transparency. Data Highway came into existence for this very reason. Throughout our time in the team, we have been preparing to handle the enormous surge of data. And IPL posed as a challenge for all of us! The team thus began working on creating a novel system that followed the Lambda architecture, allowing horizontal scaling and providing more flexibility to offer 360-degree realistic analytics on user concurrency, transaction rates etc.

A complete overview of the progress made by our team

An Iron-Clad Protection For The Entire System

Being a lean team, when our Security team heard that we were to serve more than 8 Million concurrent users this season, their first thoughts were to regroup and identify the areas of improvement needed in our security. The team dived deep to identify possible roadblocks, checkpoints and tasks as a way to strategise before the start of this season of Dream11 IPL. With not much time, it was all hands on deck right from the start. Our first step was to detect the problem areas, and we focused on three key areas viz. Detect, Protect with self-correction and Monitor for Application & Cloud Infrastructure with an Automation-First-Approach. Once we were able to identify security loopholes in our system, the next steps were relatively easy. We tackled each issue head-on and tweaked the system to reach the maximum level of security protection. 

An increased efficiency in handling WAF logs for large volume of user concurrency

However, we didn’t want to stop there! Our goal was to make the system more resilient, self aware and intelligent. To do this, we pooled our resources for creating an automated anomaly detection, correction and alerting system for our entire system. 

The entire application currently used, liked and enjoyed by our users is a result of ardent efforts, meticulous preparation and teamwork by all the Dream Teams. The culmination of their hard work and unscalable expertise reflects the overall quality of Dream11. While it may look simple for people on the outside, the work put behind every single button tap, every screen transition, and every data request, cannot be easily quantified. At Dream11, we work hard every day to make fantasy sports an enjoyable experience for all our users. 

Share this: