Conf42 Site Reliability Engineering (SRE) 2025 - Online

- premiere 5PM GMT

SRE at Scale: Building Resilient Cloud Infrastructure for Higher Education ERPs

Video size:

Abstract

Learn how we transformed brittle university ERPs into resilient cloud systems with 99.99% uptime and 78% fewer security incidents. Get battle-tested SRE strategies for handling extreme traffic spikes and building infrastructure that never fails—even during finals week.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to the SRE at 2025. Today. The topic that I'll be talking about is SRE at scale building resilient cloud infrastructure for higher education enterprise resource planning systems. Now welcome to this technical deep dive into site reliability engineering SRE practices for cloud-based enterprise resource planning systems. In higher education today, we'll explore how applying sophisticated reliability principles can transform fragmented legacy systems into resilient cloud architectures that deliver consistent performance during critical periods. Drawing from real world implementations across various institutional settings, I will share the strategies that achieve. 99.99% uptime during peak registration periods and reduced security incidents by 78%. Let's explore the technical foundations of reli operations in educational environments. So. Before we get started into the details, let me tell you something about myself. I'll give you a background of myself. My name is Jeev pga. I have studied from the Vicious Area Technological University. I have 21 years of experience working in multiple projects around the globe. Currently I'm working as a program manager at amazon in United States, in the Washington State. And prior to that I have worked for many projects for the state, for the higher education domain. And the experiences that I have gathered is what I'm going to share in this presentation. So next, so the next, we are talking about the higher education ERP landscape. Now higher education, enterprise resource planning landscape. It can be distinguished into three major areas, which is cloud-based system, which is the modern containerized solution. Then we have the architecture of a hybrid architecture, which bridges between cloud and on-premises. And then we have the legacy systems, which is a fragmented institutional databases that we have seen over the years. And, the, and we have seen different challenges that we have with the legacy systems, which are fragmented in different campuses be, and they don't share data with other campuses. Now, higher education institutions face unique challenges with the their ERP implementations. Many begin siloed, departmental complex integrations needs and technical debt. Legacy systems often struggle with modern security requirements and scaling demands and security is a big concern when it comes to any industry, especially when it comes to higher education industry because you have sensitive so many sensitive information that the educational institutes collect from the students and the staff. Now the trend towards cloud-based ERP solutions offers tremendous benefits, but it requires careful architectural planning. Institutions must balance accessibility, compliance requirements and the seasonal nature of academic workloads, all while maintaining constant uptime for critical services. And that is what makes the higher educational ERP landscape. Challenging for implementing any ERP solution. Now in this slide we are going to talk about the microservices architecture for educational workloads. Now, what is this microservices architecture for educational workload? Here we are seeing on the left hand side we are talk talking about containerization. Then we are looking into orchestration in this phase, and then we are looking at the service mesh, and then we are looking at the security layer, which has been applied. Now, what is conization? It's a docker based services with education specific. Configuration, which which is done for the, for the users to consume the the information which is being sent. So that is what is containerization. So if we are working for any educational institute, it's the logo of the institute. It's the what information needs to be shared with that particular you know, user is what is containerization now. Orchestration is the Kubernetes clusters with automated scaling policies. And then we have the service mesh, which is the Istio powered traffic management with encrypted east waste traffic. So what kind of, how do we manage the traffic during peak period is what is the service mesh? And then we come to the security layer, which is a zero trust network policy with identify with identity verification, which is very critical. To the higher education. Now, breaking monolithic ERP systems into containerized microservices has proven transform transformative for educational institutions. This architecture, what it does is it enables independent scaling of critical components during registration period and financial aid processing season, because what we have seen is during the registration period. More resources needs to be allocated so that the students can register during financial aid processing. Also, we need more resources which can be allocated to those functions. So that is how we see that this architecture helps during those peak periods in managing the workload, we have implemented automatic failover configurations that maintain service continuity. Even when primary systems experience issues. So these are the challenges that we have seen in the monolithic ERP systems, but can be overcome by using this framework of microservice architecture. I. Next we will look into how can we establish education specific SLOs. Now looking into the SLOs that we can, we need to implement for any higher education ERP system implementation is student portal availability. So student portal availability is key to the business of higher education. So 99.99% uptime during registration period with 500 millisecond response time SLO, lower threshold during off peak periods to optimize resource utilization and cost. The next is financial system reliability, 99.999%. Transaction consistency with dual fees. Commit protocols. I. Prioritize during financial aid, disbursement cycles, and dedicated infrastructure resources. The next is faculty service, performance, grade submissions, and course management operations with 1.5 seconds maximum response time, and 99.9% availability with extended thresholds during non-teaching periods. Service level objectives for educational ERP systems must reflect the cyclic nature of academic operations. We developed custom error budgets that vary throughout the semester, allowing for more experimental deployments during quite periods while enforcing strict changes change freezes during critical academic events. It is very important that there are some you know, change freeze times so that it does not impact the peak period. This approach has allowed it teams to balance innovation with stability, and our dashboards provide real time visibility into remaining error budgets, influencing deployment decisions and maintenance scheduling across multiple institutions and stakeholders. Multiple institutional stakeholders. Moving on to the next one, which is comprehensive observability stack. So what are the different observability stacks that we need to look into? Metric collection. So Prometheus based telemetry with educational specific gauges. Log aggregation, centralized ELK stack with P two. AWARE filtering, altering, altering alerting systems, which is pager due to integration with role-based escalation parts distributed tracing. Yeager implementation tracking cross service transactions. Effective observability proved critical for maintaining ERP reliability. We implemented custom instrumentation for student facing services, capturing not just technical metric, but also user experience indicators like registration completion rates and form submission successes. Our observability stack. Includes specialized monitoring for authentication framework. Given the complex identity management needs in educational settings, we track federation's performance across multiple identity providers. I. And maintain visibility into single sign-on performance with alerts triggered when authentication latency exceeds predefined thresholds. So it's very important to have thresholds to measure those service SLOs and SLAs. Next, we are looking into chaos engineering in educational context. The the game day exercises, which is structured failure simulation with cross-functional teams during maintenance windows. Next is automated failure injection programmatic infrastructure and network degradation during non-critical periods. Next is schedule resilience testing quarterly comprehensive system test with increased complexity. Next is documentation, update run books and recovery procedures refined after each exercise. Now, these are some things some steps which are critical because even after implementation, these are some of the tests which needs to be done to make sure that when the peak is reached or when we are going through the peak or when we are going through the critical period of registration and financial aid awarding. The system is able to meet the demands or meet the needs of the educational institutions. Now, chaos engineering practices have proven invaluable for validating ERP resilience. We designed experiments that simulate realistic failure scenarios, including database unavailability during registration periods, identity provider outages. And network partitions between services. This control exercises reveal subtle dependencies and failure modes that warrant apparent in architectural reviews. By systematically injecting failures, we identified and remediated critical vulnerabilities because they affected production systems. Each KIOS experiment resulted in improved recovery procedures and more resilient system design. So that is where the Kios engineering in educational context is very, very relevant as well. Next is automated remediation workflows. Now anomaly detection is the first step. Next is automated diagnosis and then self-healing actions. Now, what is anomaly detection? It's the machine learning powered baseline deviation identification. Next is pattern recognition with contextual analysis, which is automated diagnoses and the self-healing actions is Kubernetes operators implementing recovery patterns. Our implementation of automated remediation workflows reduced mean time of recovery, which is MTTR by 65%. We developed a tiered approach to automation. Level one remediations execute automatically without human intervention. Next is while more complex, level two and level three actions require progressive levels of approval and oversight. The system leverages machine learning to detect anomalies specific to educational workloads added that precede incidents when potential issues are detected. Self-healing mechanisms can execute predefined playbooks from simple PO starts to complex databases, failovers, and data integrity verifications. Next is resilient, API gateway architecture. Now legacy system integration. Our API Gateway implementation includes specified adapters for legacy systems with protocol translation and data transformation capabilities. This adapters maintain backward compatibility while enabling modern security practices. Circuit breakers prevent cascading failures when legacy components experience issues gracefully degrading functionality rather than allowing complete outages. Next is authentication and authorization. The gateway handles complex authentication, flow common in educational settings, including SAML federations with research institutions. Oau integration with cloud providers and legacy authentication methods. Fine-grained authorization policies enforces appropriate access controls while maintaining audit trails for compliance requirements. So that's a authentication and authorization plays a critical role when we are developing a resilient API gateway architecture. Next is the traffic management. Which helps in making sure that there is no delay or we are meeting the SLAs. Now, sophisticated rate limited protects backend service during peak periods. I. Implementing fair scheduling algorithms then that prioritize critical operations during high demands. Events like course registration, dynamic routing capabilities enable gradual rollouts of new services with cannery deployments and AB testing configurations. The EAPI gateway serves as a critical reliability component. Providing a consistent interface while abstracting the complexity of underlying systems. This architecture proved particularly valuable during migrations, allowing incremental modernization without disrupting service availability. Because, because any kind of an, any kind of institutions, it's a, it's a large project and, it is better approach with an agile approach. And when we are doing an agile approach, it is implemented in different iterations and phases. So when we are implementing in different iterations and phases, it is very important that we follow the resilient API gateway architecture where some of the systems have been modernized while some are still on legacy systems. So that is why where this resilient API gateway architecture, while implementation becomes very critical. Next, moving on to immutable infrastructure for scaling now infrastructure as, as code terraform modules with specif specialized configurations for educational workloads, version controlled and peer reviewed, which is very much critical for infrastructure as code now automated A A MI creation. Packer pipelines producing hardened images with pre-installed monitoring agents and security controls are very much important. Pre-deployment validation, automated compliance and vulnerability scanning, integrated into CICD. Continuous improvement and continuous development. Pipelines are very important. Blue-green deployments. Zero downtime implementation with automated rollback capabilities based on health metrics. This is very much in. Important because while implementation, if anything goes wrong goes wrong, then there should be a way to look into the different metrices to see whether the implementation was successful or not. If it is successful that gives you a confirmation to roll it out to all the users to start to have them start using the new ERP system. If it does not go right, then it's very important that we have a rollback capability and rollback plan as well. Now implementing immutable infrastructure principles transformed our ability to scale during peak periods. Rather than patching or modifying running systems, we deploy new instances from validated images and templates, ensuring consistent configuration and eliminating configuration drifts. This approach enables rapid horizontal scaling. During high demand periods like registration and financial aid deadlines pre-provisioned capacity plans based on historical usage patterns, ensure resources are available when needed, while allowing, allowing for efficient cost management during quieter periods. Now moving on to the next one, which is security as a code with continuous compliance. Now, automated vulnerability scanning is the first one, which is continuous assessment of infrastructure and appli application components against education sector, threat models, and compliance framework. Next is policy S code OPA based guardrails, enforcing security requirements across container orchestration and cloud resources. The third one is Secret Man Secrets Management Vault Implementation with dynamic credential generation and automatic rotation for sensitive systems. Next is compliance reporting, automated evidence collection and documentation generation for FERPA, GLBA and other regulatory requirements. Security as core practices have fundamentally changed how we approach GRP protection by coding security policies and compliance requirements. We have integrated them directly into infrastructure provisioning and app application deployment processes. This approach reduced security incidents by 78% while streamlining compliance processes, continuous validation against frameworks like NIST 800, slash 1 71 and fair part requirements. Ensure systems maintain compliance throughout their lifecycle with automated remediation of deviations from security baselines. So that was the end of the session. I thank you for joining this session. Have a wonderful day. Thank you.
...

Sanjiv Bhagat

@ Visvesvaraya Technological University



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)