Conf42: Site Reliability Engineering 2022


Transformation & Cultural Shift using Site Reliability Engineering (SRE) & Data Science

Rohit Sinha
Director- Cloud Application Services @ T-Systems International

Rohit Sinha's LinkedIn account Rohit Sinha's twitter account

Transformation of Elephant style monolithic organization is a difficult task. As part of current role to lead the transformation to an SRE cultural is a challenge, this is a journey and as an organization we shall continue to evolve. SRE Implementation was not focused on only having SRE team for the organization but a bootstrap concept was introduced which would manage the SREs who are in each and every unit (Every team) under the guidance of Bootstrap. This has been effective to ensure penetration in day to day working of the organization and that’s when the wheel starts turning. The outcome has been every tribe has a designated set up of SRE across the organization. The usage of AI/ML for infrastructure based analysis of ITSM data as well as Infrastructure log data to ensure the SREs are armed with best predictions for the infra and component to take the best decisions. This has helped in cost reduction but also helped in ensuring that the legacy tooling is slowly shaved out. This has been a data driven approach to systematic roll out of SRE set up. In house tooling culture was put as part of the release train to ensure the demand & the interest from the operations team is in tact. The tooling cannot be suddenly changed over night but had to accommodate the old and the new to ensure no service disruption. This was achieved by having an architecture which combines old and the new, ensuring the old set up of automation works while the new age DevOps based tooling and integrate API solutions are also integrated. The predication based set up would Definitely change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .(More about it later ;))

Problems are plenty and especially for an organization which provides services, as revenue defines each big customer and sometimes these units which cater to a particular customer starts acting like individual companies and tries to define standards and ways of working.

While SRE rollout in industry is not unique but surely the set up is unique. The usage of AI/ML for ITSM and Infra data is the uniqueness. The prediction using the logs of different parameters of the system and combining with corraltion matrix of close to 30 odd parameters of a machine . This in house built set up, combined with other parameters makes a unique experience gives SRE an edge already to perform what they are good at . The analysis is there , its about applying the principles. This was unique as a monolithic organizations try for an immediate benefit and we were able to provide it with combination of re structure in the internal services & ensuring SREs are not just introduced as a concept from top down but going to designate SRE culture at every tribe level. This was orchestrated by a bunch of best SREs ( under the SRE bootstrap) who provided constant guidance to the tribe designated SREs and also lead the bigger tooling and AI based Usage initiatives.

The predication based set up would Definite change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .This set up works well as then culturally, at every team level we are changing and penetrating. It had it own set of challenges and push backs which is out of learning .

Modernization using SRE - The entire detailed set up is an amalgamation of SRE + usage of AIOPS based structure. While SREs part is more about following the culture and principles , the analytics based tech set up is like providing them with ammunitions to be proactive and predictive in problem solving. The above thought process covers the 2 most important principles, Monitoring ( advanced & Simplicity ( something which is an underlying horizontal). Emphasis is on visibility engineering, to ensure everything is seen, can be tracked, predicted and can be controlled. This is not just limited to the landscapes, ecosystem only but is being used for ITSM components as well. Process simplification has been kick started based on data analysis of why and how this time can be reduced which ties back to the another Principle, elimination of toil.

ToolChain Set Architecture : The tool chain is a unique amalgamation of following areas

  1. ITSM Automation based – > Analytics on ITSM/ITIL based process
  2. Monitoring (Analysis of data) à Visibility engineering
  3. Advanced Monitoring (AIOPS)
  4. Automation modernization

While analytics is a great weapon for ITIL based set up, for example : analysis to see where the service is spending most of the time on, Why a particular type of change gets stuck in a particular stage etc etc. The Incident based , Machine Learning based set up can always provide trend pattern, correlation and automatic error categorization based on historic data, it helps to faster resolution, know what is coming and what pattern is being followed . For us it was stage 1 to reach as it gives instant benefits due to correlation and detailed part. An element of Visibility engineering was added to ensure SRE s benefit from it and are not always drowned in excel and reports. This also helped them in determining the Error budgeting and Embracing Risk as they have first hand , clear cut data analysis available.

Visibility engineering takes a different step when we take it in the direction of log analytics. The power to use ARIMA ( just as an example) for prediction on Infra based data. E.g. Logs of Memory utilization for a server , what would happen in next 3-5 days. The capturing of data at every 5 minute interval. Correlation of 20 odd parameters to arrive at a matrix to predict failure.

A key element is Automation , which is considered the last part of the chain but the set up cannot be legacy to attain best result. It has to be a combination, the entire old house cannot be torn down to build a new one, that’s too much of disruption but a design has to be put in place to accommodate old and new. This can be seen in using modern – Open source tools while we still ensure the set up supports the old tooling .This makes the set up extremely Integra table

Data Science - AI/ML/IOT : The SRE principles revolve around the monitoring, automation, toil reduction etc and one of the potent weapons in that aspect is advance monitoring using analytics based set up, auto decision making ( self healing scenario) set up which all account to toil reduction and further streamlining the service. A lot of data is generated by configurable Items ( Cis) in the IT ecosystem, while some of them are really used for finding the root cause, a lot of it can be used for prediction, finding areas of improvements etc. The last thing we want SREs to do is just aim in the dark by going through plethora of data and scratching through number of excel sheets. The data which they analyze can further by simplified by using AI/ML/ data science based solutions which can not only save time for SREs, but ensure that SREs focus on the real problem solving rather than search what is the actual problem.

AIOPS based set up is going to ensure SREs get a lot of information before hand for them to make quick decisions, While trend and pattern can be derived for sure from the historic data, a key aspect is predictability . The proactiveness would come if we provide out of box solutions to SREs which can help them predict problems, show they trends and patterns of what they are looking for.

One of the basic example is infrastructure failure prediction, this can be based on the log based prediction analysis combined with correlation matrix of each and every parameter which is coming out of the data. This way, with certain accuracy, the prediction mechanism can show if the infra would fail in near future . This cannot be achieved alone by incident based and alert based analysis as these are static thresholds but more dynamic, mathematical , moving average based threshold needs to be defined which can with more accuracy can predict the breakdown.

This in long run would modify the way in which operations is done when the entire ITIL process would require a change. The concept of static threshold would be gone (alert based mechanism to incident) & there would be no incidents as prediction would take care of lot of problems. Even more, with elastic and hyperscale Infra set up, moving averages and dynamic threshold would be the future. This would change the way Operations is done today by no incident based set up. This adoption would take time as it requires changing the way we work in the kitchen but some useful UCs, like triaging based on Machine learning is definitely being implemented.

New addition, usage and definition of Error Budget definition, how proper , guideline based error budget can help

Awesome conferences for

Priority access to all content

Community Discord

Exclusive promotions and giveaways