Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone and welcome to the SRE at 2025.
Today.
The topic that I'll be talking about is SRE at scale building resilient cloud
infrastructure for higher education enterprise resource planning systems.
Now welcome to this technical deep dive into site reliability engineering
SRE practices for cloud-based enterprise resource planning systems.
In higher education today, we'll explore how applying sophisticated reliability
principles can transform fragmented legacy systems into resilient cloud
architectures that deliver consistent performance during critical periods.
Drawing from real world implementations across various institutional settings, I
will share the strategies that achieve.
99.99% uptime during peak registration periods and reduced
security incidents by 78%.
Let's explore the technical foundations of reli operations
in educational environments.
So.
Before we get started into the details, let me tell you something about myself.
I'll give you a background of myself.
My name is Jeev pga.
I have studied from the Vicious Area Technological University.
I have 21 years of experience working in multiple projects around the globe.
Currently I'm working as a program manager at amazon in United
States, in the Washington State.
And prior to that I have worked for many projects for the state,
for the higher education domain.
And the experiences that I have gathered is what I'm going to
share in this presentation.
So next, so the next, we are talking about the higher education ERP landscape.
Now higher education, enterprise resource planning landscape.
It can be distinguished into three major areas, which is cloud-based system, which
is the modern containerized solution.
Then we have the architecture of a hybrid architecture, which bridges
between cloud and on-premises.
And then we have the legacy systems, which is a fragmented institutional databases
that we have seen over the years.
And, the, and we have seen different challenges that we have with the
legacy systems, which are fragmented in different campuses be, and they
don't share data with other campuses.
Now, higher education institutions face unique challenges with
the their ERP implementations.
Many begin siloed, departmental complex integrations needs and technical debt.
Legacy systems often struggle with modern security requirements and
scaling demands and security is a big concern when it comes to any industry,
especially when it comes to higher education industry because you have
sensitive so many sensitive information that the educational institutes collect
from the students and the staff.
Now the trend towards cloud-based ERP solutions offers tremendous
benefits, but it requires careful architectural planning.
Institutions must balance accessibility, compliance requirements and the
seasonal nature of academic workloads, all while maintaining constant
uptime for critical services.
And that is what makes the higher educational ERP landscape.
Challenging for implementing any ERP solution.
Now in this slide we are going to talk about the microservices
architecture for educational workloads.
Now, what is this microservices architecture for educational workload?
Here we are seeing on the left hand side we are talk
talking about containerization.
Then we are looking into orchestration in this phase, and then we are
looking at the service mesh, and then we are looking at the security
layer, which has been applied.
Now, what is conization?
It's a docker based services with education specific.
Configuration, which which is done for the, for the users to consume the
the information which is being sent.
So that is what is containerization.
So if we are working for any educational institute, it's the logo of the institute.
It's the what information needs to be shared with that particular you know,
user is what is containerization now.
Orchestration is the Kubernetes clusters with automated scaling policies.
And then we have the service mesh, which is the Istio powered traffic management
with encrypted east waste traffic.
So what kind of, how do we manage the traffic during peak period
is what is the service mesh?
And then we come to the security layer, which is a zero trust network
policy with identify with identity verification, which is very critical.
To the higher education.
Now, breaking monolithic ERP systems into containerized microservices
has proven transform transformative for educational institutions.
This architecture, what it does is it enables independent scaling of
critical components during registration period and financial aid processing
season, because what we have seen is during the registration period.
More resources needs to be allocated so that the students can register
during financial aid processing.
Also, we need more resources which can be allocated to those functions.
So that is how we see that this architecture helps during those
peak periods in managing the workload, we have implemented
automatic failover configurations that maintain service continuity.
Even when primary systems experience issues.
So these are the challenges that we have seen in the monolithic ERP systems,
but can be overcome by using this framework of microservice architecture.
I.
Next we will look into how can we establish education specific SLOs.
Now looking into the SLOs that we can, we need to implement for any higher
education ERP system implementation is student portal availability.
So student portal availability is key to the business of higher education.
So 99.99% uptime during registration period with 500 millisecond response
time SLO, lower threshold during off peak periods to optimize
resource utilization and cost.
The next is financial system reliability, 99.999%.
Transaction consistency with dual fees.
Commit protocols.
I. Prioritize during financial aid, disbursement cycles, and
dedicated infrastructure resources.
The next is faculty service, performance, grade submissions, and
course management operations with 1.5 seconds maximum response time,
and 99.9% availability with extended thresholds during non-teaching periods.
Service level objectives for educational ERP systems must reflect the cyclic
nature of academic operations.
We developed custom error budgets that vary throughout the semester,
allowing for more experimental deployments during quite periods while
enforcing strict changes change freezes during critical academic events.
It is very important that there are some you know, change freeze times so
that it does not impact the peak period.
This approach has allowed it teams to balance innovation with stability,
and our dashboards provide real time visibility into remaining error budgets,
influencing deployment decisions and maintenance scheduling across
multiple institutions and stakeholders.
Multiple institutional stakeholders.
Moving on to the next one, which is comprehensive observability stack.
So what are the different observability stacks that we need to look into?
Metric collection.
So Prometheus based telemetry with educational specific gauges.
Log aggregation, centralized ELK stack with P two.
AWARE filtering, altering, altering alerting systems, which is pager
due to integration with role-based escalation parts distributed tracing.
Yeager implementation tracking cross service transactions.
Effective observability proved critical for maintaining ERP reliability.
We implemented custom instrumentation for student facing services,
capturing not just technical metric, but also user experience indicators
like registration completion rates and form submission successes.
Our observability stack.
Includes specialized monitoring for authentication framework.
Given the complex identity management needs in educational settings,
we track federation's performance across multiple identity providers.
I. And maintain visibility into single sign-on performance with
alerts triggered when authentication latency exceeds predefined thresholds.
So it's very important to have thresholds to measure those service SLOs and SLAs.
Next, we are looking into chaos engineering in educational context.
The the game day exercises, which is structured failure
simulation with cross-functional teams during maintenance windows.
Next is automated failure injection programmatic infrastructure and network
degradation during non-critical periods.
Next is schedule resilience testing quarterly comprehensive system
test with increased complexity.
Next is documentation, update run books and recovery procedures
refined after each exercise.
Now, these are some things some steps which are critical because even after
implementation, these are some of the tests which needs to be done to make sure
that when the peak is reached or when we are going through the peak or when we
are going through the critical period of registration and financial aid awarding.
The system is able to meet the demands or meet the needs of
the educational institutions.
Now, chaos engineering practices have proven invaluable for
validating ERP resilience.
We designed experiments that simulate realistic failure
scenarios, including database unavailability during registration
periods, identity provider outages.
And network partitions between services.
This control exercises reveal subtle dependencies and failure modes that
warrant apparent in architectural reviews.
By systematically injecting failures, we identified and remediated
critical vulnerabilities because they affected production systems.
Each KIOS experiment resulted in improved recovery procedures and
more resilient system design.
So that is where the Kios engineering in educational context
is very, very relevant as well.
Next is automated remediation workflows.
Now anomaly detection is the first step.
Next is automated diagnosis and then self-healing actions.
Now, what is anomaly detection?
It's the machine learning powered baseline deviation identification.
Next is pattern recognition with contextual analysis, which is automated
diagnoses and the self-healing actions is Kubernetes operators
implementing recovery patterns.
Our implementation of automated remediation workflows reduced mean
time of recovery, which is MTTR by 65%.
We developed a tiered approach to automation.
Level one remediations execute automatically without human intervention.
Next is while more complex, level two and level three actions require progressive
levels of approval and oversight.
The system leverages machine learning to detect anomalies specific to educational
workloads added that precede incidents when potential issues are detected.
Self-healing mechanisms can execute predefined playbooks from simple PO
starts to complex databases, failovers, and data integrity verifications.
Next is resilient, API gateway architecture.
Now legacy system integration.
Our API Gateway implementation includes specified adapters for legacy
systems with protocol translation and data transformation capabilities.
This adapters maintain backward compatibility while enabling
modern security practices.
Circuit breakers prevent cascading failures when legacy components experience
issues gracefully degrading functionality rather than allowing complete outages.
Next is authentication and authorization.
The gateway handles complex authentication, flow common in
educational settings, including SAML federations with research institutions.
Oau integration with cloud providers and legacy authentication methods.
Fine-grained authorization policies enforces appropriate access
controls while maintaining audit trails for compliance requirements.
So that's a authentication and authorization plays a critical
role when we are developing a resilient API gateway architecture.
Next is the traffic management.
Which helps in making sure that there is no delay or we are meeting the SLAs.
Now, sophisticated rate limited protects backend service during peak periods.
I. Implementing fair scheduling algorithms then that prioritize
critical operations during high demands.
Events like course registration, dynamic routing capabilities enable gradual
rollouts of new services with cannery deployments and AB testing configurations.
The EAPI gateway serves as a critical reliability component.
Providing a consistent interface while abstracting the complexity
of underlying systems.
This architecture proved particularly valuable during migrations, allowing
incremental modernization without disrupting service availability.
Because, because any kind of an, any kind of institutions, it's a, it's
a large project and, it is better approach with an agile approach.
And when we are doing an agile approach, it is implemented in
different iterations and phases.
So when we are implementing in different iterations and phases, it is very
important that we follow the resilient API gateway architecture where some
of the systems have been modernized while some are still on legacy systems.
So that is why where this resilient API gateway architecture, while
implementation becomes very critical.
Next, moving on to immutable infrastructure for scaling now
infrastructure as, as code terraform modules with specif specialized
configurations for educational workloads, version controlled and
peer reviewed, which is very much critical for infrastructure as
code now automated A A MI creation.
Packer pipelines producing hardened images with pre-installed monitoring agents and
security controls are very much important.
Pre-deployment validation, automated compliance and vulnerability
scanning, integrated into CICD.
Continuous improvement and continuous development.
Pipelines are very important.
Blue-green deployments.
Zero downtime implementation with automated rollback capabilities
based on health metrics.
This is very much in.
Important because while implementation, if anything goes wrong goes wrong, then
there should be a way to look into the different metrices to see whether the
implementation was successful or not.
If it is successful that gives you a confirmation to roll it out to
all the users to start to have them start using the new ERP system.
If it does not go right, then it's very important that we have a rollback
capability and rollback plan as well.
Now implementing immutable infrastructure principles transformed our ability
to scale during peak periods.
Rather than patching or modifying running systems, we deploy new instances
from validated images and templates, ensuring consistent configuration
and eliminating configuration drifts.
This approach enables rapid horizontal scaling.
During high demand periods like registration and financial aid deadlines
pre-provisioned capacity plans based on historical usage patterns, ensure
resources are available when needed, while allowing, allowing for efficient
cost management during quieter periods.
Now moving on to the next one, which is security as a code
with continuous compliance.
Now, automated vulnerability scanning is the first one, which is continuous
assessment of infrastructure and appli application components
against education sector, threat models, and compliance framework.
Next is policy S code OPA based guardrails, enforcing security
requirements across container orchestration and cloud resources.
The third one is Secret Man Secrets Management Vault Implementation with
dynamic credential generation and automatic rotation for sensitive systems.
Next is compliance reporting, automated evidence collection and
documentation generation for FERPA, GLBA and other regulatory requirements.
Security as core practices have fundamentally changed how we approach
GRP protection by coding security policies and compliance requirements.
We have integrated them directly into infrastructure provisioning and app
application deployment processes.
This approach reduced security incidents by 78% while streamlining compliance
processes, continuous validation against frameworks like NIST 800,
slash 1 71 and fair part requirements.
Ensure systems maintain compliance throughout their lifecycle
with automated remediation of deviations from security baselines.
So that was the end of the session.
I thank you for joining this session.
Have a wonderful day.
Thank you.