Conf42 Platform Engineering 2023 - Online

Building a scalable ecosystem for high-loaded multiplayer game

Video size:

Abstract

Discover the inner workings of successfully managing a large-scale multiplayer game. This presentation will delve into the tools and techniques required to ensure a smooth operation of such a massive system. We will cover effective methods for organizing continuous integration and delivery to improve response times to any potential issues.

Additionally, we will explore the importance of breaking down backend infrastructure into independent services to enhance performance and resilience. This approach allows for seamless handling of multiple service failures.

Furthermore, we will discuss the crucial role of a cohesive team in this process and how to efficiently plan releases. Implementing the techniques we cover can make a significant impact on the product lifecycle, promoting higher productivity and efficiency.

Finally, we will illuminate the benefits of setting up pipelines in GitLab, a valuable resource for code validation and error handling. By taking preventative measures instead of reacting to potential issues, these methods can revolutionize the way we handle problems.

Upon completion of this presentation, you will have gained valuable insights into the intricacies of managing a high-volume multiplayer game and the strategies that can simplify the process. Whether you work with multiplayer games or have an interest in scalable system design, this talk will provide you with actionable information to help you succeed.

Summary

  • Mitri Vaishinka is a software engineer at MidAF Games. He talks about the pillars of his company's development and deployment processes. He says automation is at the heart of their operations. Timcity serves as a linchpin in their CI CD architecture.
  • The next big part of our CI is git hooks. These client hooks system consists of several components. Each services a specific function in the version control workflow. These tools serve as proactive measures to maintain a high standard of coding practices.
  • Our current architecture combines various systems and components coordinated through photon cloud for real time multiplayer gaming. The gaming platform incorporates a range of specialized services to enhance user experience. The architecture consists of several key components and designed to handle specific functionalities.
  • Our game update system achieves zero downtime, ensuring an uninterrupted game experience for our players. Our setup involves three main elements, your client and two servers named alpha and beta. Two objective is to transition the game traffic seamlessly from alpha to beta, all while ensuring that players experience zero interruptions.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Mitri Vaishinka and I am a software engineer at MidAF Games. I specialize in mobile game and backend system development for over a decade and I am a member of International Game Developer association and an author of articles on medium and Hakunun. And let's talk about the pillars of our stack that ensure our operations run smoothly and efficiency. First on the list is our CI CD organization, which is the backbone of getting a large development team up and running handsomely. This setup ensures that integration and delivery happen seamlessly, allows for frequent updates without disrupting these user experience. Next, we focus on backend infrastructure, preparing it materials for scalability. As our user base grows, it's crucial to have backend robust enough to handle increasing load while maintaining performance. And finally, let's not forget our blue green deployment strategies. This method allows us to release new versions of our software without having to pause for maintenance or technical workings. It means zero downtime and a smoother, more reliable experience for our users. Altogether, these elements form an integrated approach to development and deployment, setting the stage for both agility and reliability. Let's start with CI CD organization. In our technology stack, we've adopted some of the industry's best practices to measures seamless and efficient deployment process. We operate under the infrastructure as a code paradigm, which allows us to manage and provision our technological infrastructure through machine readable definition files. These enabling rapid deployment and version control for the sake of environment reproducibility, particularly for continuous integration and staging, we utilize containerization techniques. This enhance both portability and consistency across all stages of development further bolster our reacting efforts. We employ an emulator farm that allows us to simulate various scenarios and environments. Simplification is a key in our approach as we've streamlined the deployment of both test and production replay servers to be as easy a single click. Finally, we conduct a cohesive suit of checks at the merge request stage to ensure good quality and functionality. These practices collectively contribute to a robust, agile and highly dependable development ecosystem. So we use versioning to track changes in our infrastructure as a code, it promoting us the ability to revert these previous stages effortlessly. Typically, we use Git. Second, automation is at the heart of our operations. It frees us from manual toil and allows for agile adaptation. Third, our use of code for describing infrastructure guarantees repatability and consistency. It making transitions from reacting to production environments seamless and fourth, our code doubles in our documentation, it offering transparency in understanding how our infrastructure is set up and configured. Fifth, scalability and making adjustments are straightforward efforts, we simply modify the code and apply those changes effectively. Future promoting our infrastructure. Lastly, our code based approach enhanced collaboration among team members and enables rapid responses to ever managing business requirements. These guiding principles form the cornerstone of our efficient, reliable and agile infrastructure management strategies. So with Timcity, you can store all your configurations using Kotlin DSL. With GitLab, you can do the same with YaML, and GitLab also offers convenience of performing checks after pushing. Timcity serves as a linchpin in our CI CD architecture. It workings several advantages that streamline our development and deployment processes. One of its key strength is seamless integration with a wide array of version control systems such as GIT, SVN and Mercurial, and that ensuring that our development workflow remains fluid regardless our choice of version control. Additionally, Teamcity's settings system is exceptionally flexible. It allows us to tailor our CI CD pipelines according to specific needs and conditions that truly sets it apart. Though it's a capability for virtualization and distributed builds, Teamcity dramatically accelerates the build process by intelligently allocating tasks across multiple agents, thereby reacting time to market and enhancing productivity. These features collectively make Timcity an invaluable asset in our quest for more agile, efficient and reliable touch environment. GitLab runs series of tasks automatically for your project every time changes are made. It's ensuring continuous code testing and readiness for deployment. GitLab CI allows you to automate various strategies of the deployment lifecycle, from code testing to building to deployment to production. One of its key measures is the ability to create complex pipelines with multiple, parallel and sequential tasks. It breaking GitLab CI, a powerful tool for development teams configuration is described using the GitLab UML file and it's making the setup process transparent and easy customizable. GitLab CI is closely integrated with GitLab itself that simplifying these setup, monitoring and management of all aspects of CI CD in auto building building are added for which automatic building is assigned based on certain continuous for example, based on time. At the moment of the trigger for automatic building, critical validation and building of server configs are carried out first. After that the services is built and clients for the three main platforms are also built and after successful completion of the builds, the server is launched. After the server is successfully launched, the client versions are uploaded to the app center so they can be downloaded lately to devices. Thus, thanks to autobuilds in the projects, a fresh client server pair with the latest changes that have been uploaded appears every one or 2 hours. Here is the steps to build a services. First assemble the server configuration. Then they'll build these server based on these configurations and finally deploy the server. This functionality is convenient for testing new features, fixing bugs, or conducting experiments in an isolated environment without affecting these main development process. As a result of the build the server configuration, server cloud virtual machine are created and these server is launched. A unique name for a virtual machine is assigned to each user in team city and in the client you can select login address from the list of shards to connect the server. Created servers are deleted twice a week, on Thursday night and Saturday night and these creating a demand servers provide flexibility and autonomy in the development and testing process. The development process utilize merge request system to ensure that individual changes do not break, build or obstruct these others work. Developers create a new branch from an updated develop head, make some comments related to specific tasks or bugs and these push these changes to the target branch. A merge request is these created either immediately after these first push or right before merging into develop branch. The merge request undergoes various automated checks and possibly manual reviews. Once approved, the merge request can be automatically merged when all tests pass, breaking the developer to move on other tasks. If can merge request becomes outdated or irrelevant, it's essential to close it to prevent clutter and confusion in the list of active merge requests. The validation system services as an integrated tool within a unity project for validating assets and configurations within its interface. User can execute predefined or custom validators or selected assets or groups of assets. Validators examine specific aspects of can asset to ensure its integrity, such as broken links. It can be marked as critical to enforce checks during building and virtualquest pipelines. Results are displayed at the bottom of the window and any rows locked in unity. So the next big part of our CI is git hooks. These client hooks system consists of several components including the installer commit, message, pre commit, post commit, and post rewrite hooks. Each services a specific function in the version control workflow. The installer is a binary that automatically downloads and updates these hooks from central repository, placing them in local git hooks directory, and it also creates scripts to trigger the hooks and keep them up to date. Commit message checks the commands attached to the commits precommit performs a variety of file checks, and postcommit and posture write mainly handle notifications. After a commit or rebase operation, the protected branch hook restricts pushes to certain branches, essentially making them read only and preserving these integrity. The rebase required hook ensures that before you can push your changes, you must releases to have the most up to date version of the repository, thus minimizing conflicts. The new branch name hook validates the naming convention of new branches. Lastly, the message content hook enforces standardizer to commit message through the use of regular expressions. These tools serve as proactive measures to maintain a high standard of coding practices. Now let's dive in our backend infrastructure. In the architecture of the platform, a variety of tools and frameworks are employed to measures efficient and robust operation. Eclipse Vertex serves as these backbone for the messaging system and also provides clustered storage for untamed data, offering a scalable solution for high speed data handling. Complementing Vertex is Haslcast, an in memory data grid that enhances performance and scalability for data streaming and log managing by kafka plays a crucial role as a data broker. It's allowing for real time analysis and monitoring. On the database side, PostgreSQL is used to store persistent data that's offering flexibility in data storage approaches. Lastly, Ansible is utilized for automating server applications configuration and it's ensuring that the platform's diverse services are seamlessly integrated and easily manageable. The architecture of the platform consists of several key components and designed to handle specific functionalities. The account server is responsible for user authentication and maintains information regarding all connected game servers. Next, these game server is a central hub for all game mechanics, logic and data. That's ensuring that gameplay experience is consistent and engaging. And on these administrative side, game tool web serves as a comprehensive tool for managing both player accounts and server settings. Lastly, Game tools ETL works behind the scenes and final game logs from Patch Kafka into Gametool database, thereby enabling robust data analysis and reporting. This HTTP server has its own database and consists of several components. The authentication component is responsible for user authentication and distribution among the game servers and their front end components. The building component process in game purchases and the game server configuration component is used to communicate with game services, announce, maintains and perform other releases tasks. The server comprises a cluster of either physical or logical nodes, each made up of multiple components and services that interact seamlessly through a common event bus. At the front end, we have the front end component responsible for managing TCP connections and verifying clients via the account server. It serves as the main gateway for client server communication. These dispatcher queues and delegates clients requests and messages to the appropriate parts of the system. Scheduler plays a crucial role in time sensitive game mechanics, providing time and subscriptions to various companies. Our DB operation executor ensures smooth and asynchronous interaction with databases, while the resource system holds the configuration for game mechanics. Moreover, all server activities are diligently logged by our log system, which sends these logs to an Apache Kafka message broker for analysis. The server also hosts an array of specialized mechanical components such as these for missions, quests and mail, making it a comprehensive and flexible platform for an immersive gaming experience. Our game tool is an essential part of our comprehensive game management ecosystem. It consists of two pivotal components. First is game tools ETL, which stands for extract, transform and load. This component is responsible for siphoning off game logs from our Apache Kafka message broker. Process these logs and then persist these transformer data into its own dedicated database. This ensures that we have a streamlined, reliable repository for game analytics and insights. The second part is game tools Web, an administrative tool designed to offer a real time access to essential services data. The gaming platform incorporates a range of specialized services to enhance user experience. Proton is utilized for PvP and cooperative gameplay, while little rewards offers a universal system for storing and ranking player achievements. The friend service takes care of the list of friends and provide referral information. Each player also has a player profile which gives a detailed account of their in games activities and statistics. These replace services is a tool for store gameplay. Replace additional functionalities includes mail for in games, messaging chat for broader social interactions including group settings, matchmaking for effectively pairing up players as opponents or teammates can for organized group activities, and push notifications to keep players updated while real time information sent directly to these devices. Our current architecture combines various systems and components coordinated through photon cloud for real time multiplayer gaming. Photon Cloud offers low latency data centers worldwide and its versatility expands to applications beyond gaming like text and video chat. Our primary data storage is postgresql because we find that relational databases are generally more reliable and easier to validate than other data storage models. For message brokering, we use Apache Kafka due to its out of the box horizontal scaling and high reliability. We also use Hazelcast as in memory database that integrates with Vertex, our framework for building reacting applications. Our stack includes vertex, which support multiple programming languages and operates reactor pattern. Despite of its benefits, vertex can lead to complicated code, especially if the language are using isn't fully supported by the framework. In such cases, alternatives like Quasar project could be considered, although Quasar wasn't breaking actively maintained when we began our project in 2017. For transactional operations, we've created a custom object that allows linear operations within message processing. This approach covers most of our use cases. During testing, we discovered and reported a log queuing issue in veritex, which the developers have since addressed. We monitor performance metrics using primateos and visualize them using grafana. This setup helped us fine tune our vertex configuration and resolve bottlenecks. Our game cluster is a collection of machines running instances of vertex and Haslecast, with each node running various game mechanics. These mechanics are encapsulated in vertex verticals which have different tasks like game model loading or arcade tasks. To manage all these, we use a comprehensive admin interface. Scaling for performance is relatively straightforward. Our current hardware can comfortably support 150,000 users, and if we reach cpu limitation, we can add servers to the cluster and our postgresql setup might be the first bottleneck in terms of scalability, but different synchronization with Haslcast can elevate this issue. Now let's review our blue green deployment process. So why we choose to employ a blue green deployment strategies for our operations? This decision to go with this approach wasn't taken lightly as it does come with its own set of architectural and operational costs. However, in our specific context, these advantage clearly outweighed these expenses. These are two main reasons behind this decision. First, and furthermore, downtime is not just inconvenient, it's expensive. Even a minute of downtime can have significant financial implications for us, and the second reason is unique to our focus on mobile games. When we publish a new version of our mobile game client, it needs to go through a store review process which isn't instantiated and can take several days. This means we absolutely need to ability to support multiple game servers instances concurrently to align with the release cycles of mobile App Stores. So these blue green deployment strategy provides us with the flexibility and reliability we need to meet our business requirements. Our setup involves three main elements, your client and two servers named alpha and beta. Two objective is to transition the game traffic seamlessly from alpha to beta, all while ensuring that players experience zero interruptions. This discrete migration process involves not just these game servers and the clients, but also specialized account server. The sole role of this account server is to provide a client with appropriate game server address for connection. It also keeps track of the services status which is essential meta information that helps coordinate the switch. These goal is to make this transition as smooth as possible so players remain blissfully unaware that any change has even occurred. Let's walk through how our game update system achieves zero downtime, ensuring an uninterrupted game experience for our players. Initially, Alpha server is live while beta is stopped. When a player enters the game, the client contacts the account server to find out which game server is currently active. The account server responds with the address of Alpha and the client connects accordingly. Now, when it's time to update, Alpha is declared as stopped and battery is set to life. Alpha then sends reconnect broadcast to all its connected clients. On receiving this, the clients reestablished their contact with the account server which now provides the address for beta. The client switches its connection to beta seamlessly, all without the player noticing any disruption. Through these coordinated dance between the account services Alpha and beta, we effectively achieve zero downtime during server updates. There are some areas for power enhancements. First, our quality assurance specialists have expressed the need for a final testing phase on the new version of the game services before players are allowed to join. Second, we want to allow the client to complete certain activities on the same game server where they target. To facilitate these improvements, we introduced a new server status called staging. During this managing phase, access to the game server is granted to a select group, our QA specialists for organizing final testing, and ordinary players who specify these preferred game server during the login request. These added layer of sophistication measures both quality control and enhanced user experience. These is how our enhanced game update mechanism works. As illustrated in the given example, initially Alpha is live and beta is stopped. With clients connected to Alpha. The first change occurs when alpha remains live but better transitions to a managing status. This allows our QA team to perform final tests on better while keeping the bulk of the player traffic on Alpha. Once beta clears QA, it becomes live and Alpha switches to managing. At this juncture, Alpha sends out a broadcast to connect event. However, if a player is engaged in activity like battle, the client has the option to ignore these reconnect signal and stay on alpha. Finally, when Alpha transitions to stopped status, any new login attempts are directed towards beta and this games offers flexibility for various update scenarios whenever we are rolling out a completely new game version or simply pushing updates to fix bugs in existing version. This ensures both robust quality assurance and uninterrupted gaming experience for our players. While having multiple versions of servers running could technically allow players who haven't updated their game to continue player playing, it intricacies complexity. These need to maintain both forward and backward compatibility across different system companies like databases or interserver interactions. And to simplify this, we've adopted a strict responsive policy. Client of version X will only connect to server of the same version X and similar it for version Y. This approach eliminates the need for double work in maintaining protocol compatibility. Within these same version. Server changes are permissible as long as they don't affect the client and direction protocol, giving us room for operational flexibility. As a result of this, the account server now needs to be aware of game server's version and these client is required to specify which version of the game server it wishes to connect to. These streamlines the system while allowing us ample space for ongoing improvements. So version 20 is slated to replace the existing 10. Once our QA specialists have given beta the green light, we initiate what we call a soft update. During this phase, beta goes live and a fraction of players gain access to their 2.0 client via their response respective App Stores. If all goes well, with no critical bugs, we expanding these to 100% of the player base. Contrary to the Bluegreen deployment strategy, the server from previous version doesn't initiate any connections when a new version is rolled out. Now, if any issues surface, we employ these Bluegreen deployment process to transition players to third server gamma, which contains the necessary fixes. Meanwhile, players on the 10 client can continue a bare session on Alpha. Ultimately, we initiate a hard update, shutting down alpha and halting all 1.0 login attempts. Players are then prompted to update their clients to continue playing. This nuanced approach not only ensures a smooth transition, but also incorporates continuous plans for unexpected hip rs. Our server update process has been streamlined to such an extent that it's entirely managed by our QS specialists using a straightforward games tool interface. Here is how it works in a nutshell, the account server which holds its complete stage in a database. It's entirely stateless, making it highly robust and flexible. QA specialist uses the games tool to instruct the account server to change beta status from stopped to staging. This is where a final check take place, and once beta is confirmed to be live, these same QS specialist uses game tool to prompt alpha to send a reconnect signal to all connected clients, initiating their migration to beta. This approach offers a simplified, user friendly method for QA specialists to manage the complex projects of server updates. That's ensuring a seamless player experience while maintaining the integrity of our game service. This approach not only allows us to roll out game service updates without downtime, but also enables us to quickly address game mechanics, bugs and optimizations. Imagine a scenario when a critical error secures during a particular games activity. Players aren't left at large. These can still enjoy their aspects of the game while we rapidly deploy a fix. This ensures that their next attempt at the original game activity is likely to be error free, another advantage that deserves special mention in our ability to fix client side bugs through the server. This is critical because updating the mobile client through the App Store takes time, and that's making client sidebar potentially more damaging than server side ones. There have been instances there are minor adjustment to server responses effectively convinced the client to behave. While we can always count on such fortunate outcomes, our blue green development system remains our safety net even in the most unexpected of situation. So let's make some conclusions. In summary, these three pillars of our tech stack work in unison to create a highly effective, agile, and dependable ecosystem for both deployment and development. First, our CI CD organizing acts as a spine of our development structure. It not only integrates a large team, but also allows for seamless update without affecting the end users. Second, our backend infrastructure is flexible, explicitly engineered for scalability to meet the demands of a growing user base without sacrificing performance. And finally, our bluegreen development strategy that measures zero downtime during software updates, giving our users a seamless and reliable experience. Collectively, these pillars establish technology environment that is both state of the art and extraordinary reliable. Thank you for your attention and see you next time.
...

Dmitrii Ivashchenko

Lead Unity Developer @ MY.GAMES

Dmitrii Ivashchenko's LinkedIn account Dmitrii Ivashchenko's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways