Conf42 Site Reliability Engineering 2021 - Online

Microservices above the Cloud - Designing the International Space Station for Reliability

Video size:


The International Space Station has been orbiting the Earth for over 20 years. It was not launched fully formed, as a monolith in space. Instead, it is built out of dozens of individual modules, each with a dedicated role - life support, engineering, science, commercial applications and more. Each module (or container) functions as a microservice, adding additional capabilities to the whole. Not only do the modules need to function together, delivering both functional and non-functional capabilities, they were designed, developed and built by different countries on Earth and once launched into space (deployed in multiple different ways), had to work together - perfectly.

Despite the many (minor) reliability issues which have occurred over the decades, the ISS remains a highly reliable platform for cutting edge scientific and engineering research.

In this session I will describe the way the space station was developed and the lessons Site Reliability and DevOps Engineers can learn from it.


  • Robert Barron is an AI ops and site reliability engineer solution architect in IBM. He explains how the International Space Station is similar to the microservices we are developing today. You can enable your DevOps for reliability with chaos native.
  • In the early days of space exploration, we were going out into space, seeing what it's like. Modern space stations are very modular space stations, where you construct them in stages. Time goes by and this is the blueprint for future space exploration.
  • Simon is an independently flying assistant who can keep up with an astronaut and assist him or her with whatever they're doing. The International Space Station didn't actually start in 1988. International cooperation between countries was the thing that made the space station happen. The smallest things can cause the largest headaches.
  • Monoliths are simpler, even if they might be wasteful and more expensive in the long term. Technical debt is the biggest problem that we have in the industry. Technology is cool, but the business and the politics of the business is vital.
  • IBM working together with NASA, with HP, with other partners in order to deploy a unique version of cloud computing far, far above the cloud. Here are two links which will lead you down the rabbit hole into a lot of further information. Thank you and enjoy the rest of the conference.


This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native. Litmus Cloud hello everyone, and thank you for joining me here today at Conf 42, site reliability engineering. My name is Robert Barron, and I'm can AI ops and site reliability engineer solution architect in IBM. What that means? Iss basically that I help IBM's clients adopt SRE for themselves. Now, it's not very often when one's hobbies align with one's professional interests. And in this case, I'm very lucky because the history of space exploration iss something that I've always been fascinated with. And what I'm going to do is take you on a little story, a little historical storytelling expedition to show you how the International Space Station is actually very similar to the microservices that we are developing today, except that some of them are in the cloud and some of them are high above the cloud. So let's start off with thinking about why we need a space station in the first place. In the early days of space exploration, 60s, early 70s, basically, we were going out into space, seeing what it's like, seeing what we can do there, seeing how we could function in space. Only later, we started actually working in space, doing things with not merely exploration, but also productive activities that would bring resources back to Earth. So this is very similar to the early days of development versus actually getting into a production environment and generating additional value beyond what we've put in. So another way of looking at it is a space mission. For example, going to the moon is basically launching the spacecraft, going through a process of deployment, reaching our target, doing whatever we want to do there momentarily, and returning very much a CI CD pipeline of deploying a new feature where what we're concentrating on is the process of deployment itself and not so much on what we're doing with what we've deployed, because in the context of a space flight, it's done. So that's a big difference between a spacecraft and a space station. ISS that a spacecraft is a temporary activity, whereas a space station is a permanent presence. We've got multiple crews doing multiple things in the space station over time, replacing themselves, modifying the space station itself as opposed to the spacecraft, which is one thing that we've developed. We deploy and we get back. So we can look at it as a spacecraft, as a stateless process. You can either look at it as a CI CD process, or you can look at as a function that is doing something. But a space station is a full application. It's got a lot of data in it. It's very stateful. It's something that, if there's a problem with it, we can't just say, okay, we're going to try the next time, because we've invested so much in this. We need it to work. Were going to retry after our failures, not retry from start as we did with a failed spacecraft mission. For example, the famous Apollo 13 disaster in space, an explosion on the way to the moon. They didn't recover Apollo 13 itself. They replicated its mission in a future Apollo mission, Apollo 14. Now, if we look at the space stations, we can see that we have at least three generations of space stations which were developed. The first ones in the 1970s, were monolithic space stations. The entire station was launched at once into space. In many cases, it couldn't be reprovisioned. And once a few missions were performed, that was the end of the space station Sallyat. Six and seven were transitional space stations, where a central station was launched and various sidecar components were added, which gave additional capabilities, especially in engineering experimentation and scientific collection. Whereas more modern space stations, beginning with MER from the 1980s and the ISS and Tiangyang today are very modular space stations, where you construct them in stages. There were over nearly 50 flights, both of the space shuttle and of regular rockets, which launched various components into the. To build up the International Space Station. The modules have been moved around to recalibrate them, put them in better positions for whatever work they need to be done, and sometimes modules become obsolete and are replaced. Now, if we look at America's first space station, Skylab. Skylab was launched using the same technology that got the Melkins to the moon, the Saturn V. And it was actually the top third of the Saturn V was transformed into the Skylab space station. It was so large that they actually had space inside to test a jetpack. Entire Skylab was launched once with all the scientific equipment, with all the supplies that they needed, everything in a large monolith, just to illustrate the size, the internal size of Skylab, you can see the astronauts exercising, running a treadmill, which was the inside of the space station. The problem is, of course, that there's a lot of empty space in a space station like this. Whereas if you look at the International Space Station, while it iss much larger overall than Skylab, famously compared to the size of a football field, you can see that each of its components is actually much smaller than the large mask, the large monolith that Skylab was. And these pieces fit together, each of them with their own role, with their own goal, with their own targeted mission. But each of them is, in itself, much smaller than Skylab was. While the station is larger, there's a lot less open space. It's a lot less roomy than Skylab was, and that's because it was developed in a modular fashion to be brought up piece by piece, starting off with the engineers components, then adding more and more scientific and engineering exploration capabilities. Has. Time goes by. So this is the blueprint, number one, from 1998, where the space station started out. The first component was launched in 1998, and it was only completed in 2011. This short film shows us the various components. Each additional component that you see were is another launch of the space shuttle or another launch of a rocket. And you can see that pieces are being added. Step by step. Pieces are being moved from location to location because, for example, the solar panels start off in the center of the space station when there's not a lot of requirements for power. But as we need more power, more solar panels are added, and they are reconfigured into different places so that the station remains balanced. And if you have time to read the names of these components, you can see that we have more and more scientific components being added. We have more and more components which have commercial applications, allowing ground based companies to add their own explorational payloads to the space station over time. Whereas the first components, the original core of the space station, was all the life support and engineering components that were required. Unlike the monolith of Skylab, each of the components you see here has a dedicated goal. It can be the Svetster service module, which holds much of the engineering, life support, and functional capabilities of the space station. Or it can be the destiny or Columbus scientific laboratories, which perform scientific experiments. Some components are laser focused on specific things, such as the solar panels, the robotic arms, or the airlocks, which cannot be repurposed for anything else. But other components do have flexibility, especially since the station is filled with standard payload racks, which means that new scientific experiments or technical tests can be brought up on spacecraft to the station and replace the older ones. It's quite remarkable how similar a space station is today to the design of over 20 years ago. Most of the components which were decided on in 1998 do exist in some form or another. Other components, such has a dedicated living area. Along the way, they decided that there was no necessity for an entire component just for astronauts to sleep in, and the astronauts sleep in various areas that they found within the space station. I'd like to go into a number of resiliency use cases so we can see how the station operates day to day, and what can be more natural than the oxygen that the astronauts breathe. Just to be on the safe side, there are a number of multiple redundant and complementary oxygen solutions. The first one, which is what the station started with in 1998, was based on the 1980s Mirror space station, which predecessed the International Space Station. It converts water into oxygen. However, it does have a technical byproduct, which can cause clogging and other issues in the system. This is technical debt that has been plaguing the station since the very beginning. In 2006, another system was added called the oxygen generation system, which also uses the same general idea to convert water to oxygen. But the byproduct that's created requires a lot less maintenance. And a new system from 2018 uses a completely different solution, converting carbon dioxide to oxygen. And not only that, it can also create more water for electron and the oxygen generation system. So we actually see here a progression of starting off with a system that we know works, but has technical debt, another system, which improves on it, and a third system, which is now eliminating the technical debt completely, not solving the problem by creating a better or simpler byproduct, but completely changing the mechanism that they use to create oxygen, which means that the problem not only will the problem be solved more easily, but it won't come up in the first place. When there are problems and these systems don't work, then there are emergency oxygen sources. You can see on the right here chemical bottles that are used to create oxygen, or even simple bottled oxygen, which is found in the station or docked spacecraft. Despite a number of issues with the oxygen generation systems, primarily with the electron, because it's based on the oldest technology. Despite these problems, there's never been a severe problem with the oxygen, with the health and the breathing of the astronauts in the system. Throughout the over 20 years that it's been working, however, there are technical debts to the system. Electron is supposed to generate over half the oxygen for the space station, and it is very old technology. It's very difficult to find experts on Earth who are still familiar with the system, and also due to the design of the russian part of the space station, where the pieces are less modular than in the american side, it's much more difficult to replace the components, which is why the new solutions, especially the ESA solution, are coming in and will take up more, generate more and more of the oxygen of the station as time goes by. So here's an interesting edge case. Spacesuits used to walk in space. Every spacewalk is pre planned to the very last detail, including who are the astronauts who are going to be on the spacewalk. One of the reasons for this is that you need to customize the two piece suit to suit the size of the astronaut. An astronaut might want a medium upper and a large lower, or a small lower and a large upper or any other combination that will suit their size. Now, the ISS only has a limited set of pieces of these different spacesuits. And in 2019, there was a failure of a launch failure, which meant that the right astronaut who was planned to go on the spacewalk, didn't reach the space station in time. Now, they still wanted to do the spacewalk, but then they discovered that the scheduled astronauts, two women, would not be able to build two spacesuits in the sizes that they needed. So the spacewalk was postponed again till the right size spacesuit could be sent up into the space station for them to build two spacesuits which suited them. The fact of the matter is that because most of the astronauts were men, most of the spare pieces of a spacesuit were sized larger than the two astronauts who were then scheduled to do the spacewalk. While the image we have of an astronaut is that of a superhuman who can do anything, we would like to give them a hand. One of the most interesting components on the ISS is Simon, an independently flying assistant who can keep up with an astronaut and assist him or her with whatever they're doing. This can range from anything from showing documentation or a troubleshooting manual to broadcasting music for the astronaut. Simon can keep track of the astronaut and position itself, so it's easy for the astronaut to read the document Simon is displaying. Over the years, the computers we've been able to launch into space have become more powerful, and the network speeds are faster. In fact, while Simon has a powerful processor of his own, most of the work, especially the AI analysis, is offloaded and executed by Watson on the IBM cloud hundreds of kilometers below the station. While we've discussed a number of the technical things which happen in the space station, there are also a couple of procedures that we should be were of. Space station didn't actually start in 1988. It was first proposed in 1969, built it, got bogged down in budgetary issues and political issues, and it was announced in 1984 and canceled in 1993. And nothing actually happened with the space station for decades, except a lot of talking and a lot of money wasted on just designing in place instead of construction, what did work? The International Space Station. Adding the twist of international cooperation between countries, especially the United States and Russia. ISS the thing that made the space station happen. It wasn't the exploration, it wasn't the scientific advancements, it wasn't the engineering capabilities, it wasn't the commercial aspects and possibilities. No, it was the politics of countries working together, cooperating and creating something jointly. So to a great extent, the business of the space station is being an international space station. And in the same way, when we go into creating any application that we're developing, we need to understand what it is we're trying to do. We're not always trying to sell the newest widget at the lowest price. We might be wanting to do something that is politically more complex, which means that we need to be able to align the reliability goals that we have to this target. For a long time, the space station was basically supporting itself, but wasn't doing much experimentation because those components had not yet been launched. But still, humans started being in the space station, working in the space station as early as possible, because there was value simply in being there. The smallest things can cause the largest headaches. Has site reliability engineers were always conscious of the fact that we want to learn from mistakes, not just find someone to blame, built to understand the underlying reason that the problem occurred. Well, here's one example of why this is sometimes difficult. In 2018, an air leak was dedicated in the space station. After lengthy examinations, the source of the leak was found a hole in the side of one of the spacecraft which had recently docked with the station. Now, the immediate suspect in the case of a small hole in a spacecraft is a meteorite or another piece of space junk hitting it. Just a case of bad luck and statistics. That's why the station can survive multiple such strikes, and the astronauts can patch up any hole quite quickly. However, in this case, it quite obviously was not a random piece of metal which punched the hole it was drilled. But how could a spacecraft fly into space with a hole drilled into it? Were are basically two possibilities. The first is that after the spacecraft docked with the space station, can astronaut took a drone and drilled a hole in the spacecraft, or an engineer did the same thing on the ground, applied a patch which passed the pressure tests on the ground, and failed a few weeks later up in space. But why would either of them do something like this? It's hard to say. Perhaps it was sabotage. Perhaps it was user error, a slip drill and a cover up instead of a proper fix. In any case, no public summary of the cause of the issue has ever been published. While there has been a certain amount of blame game going around in the press, I'm not going to go into any details. I just wanted to remind you that while we should always try to remain technical and detached and blameless, sometimes we won't be able to remain as detached as we like from the political processes which are hovering above us. Here are some of the lessons which I hope you've seen during this session. The first one, which we learned from Skylab, is that monoliths are simpler, even if they might be wasteful and more expensive in the long term. When you choose your mvp, it might be a spacecraft, a small, stateless solution. It might be a monolithic space station, or it might be a modular space station. Don't decide your technology before you decide what you want to do with it. Technical debt is the biggest problem that we have in the industry. It's crippling. You have to be sure that you know how to transfer your knowledge. Don't get into the situations that the russian space agency is today when they have virtually no one with the skills to support the old electron oxygen system. Remove old technology when you can, replace it with new technology. If you can avoid problems instead of solving them again and again, bring in something else that will make the problem completely disappear, like the Europeans are doing with the new oxygen system, which does not require water at all. Lower the cost of learning. Technology is going forward at an increasingly increasing rate and we can't all hire only astronauts to solve our problems. This is where AI can help by pointing out what the right documentation is, what the right troubleshooting procedures are, helping us find how to solve problems faster. The topology of our systems is ever changing. No matter which diagram I show you of the International Space Station, chances are it's going to be a wrong diagram because something has happened in the last few weeks and in the cloud native development. These last few weeks could be last few seconds. Make sure that you have redundant solutions and backups for cases when you can't get rid of your technical debts. Be ready for something to fail and have a solution in place to solve it. Make sure you've got good resource management. You never know when you might need a new size of spacesuit. You never know when you might need a new node for your Kubernetes cluster or a new runtime for your system. It's impossible to have a completely blameless post incident analysis simply because were humans and simply because politics is part of technology. But try not to blame astronauts, not to blame people directly built. Keep it as process driven as possible and remember that the technology is cool, the deployed is where the fun is, but operations and production is what keeps the business going, gets the money coming in, makes our clients happy and gives us support to go on for another day and a new version of the product. Technology ISS cool, but the business and the politics of the business is vital. Keep up with the technologies, adopt the new things that you can, but don't make it your goal. The space station before the International Space Station was constantly reinventing itself using the latest and greatest technologies, but it never got off the ground. So make sure that your solutions can get to the cloud and beyond. Now if you deployed this session, and I really hope you did, I've collected some links to further reading which might interest you. I didn't really want to get into all the gory details of each and every component in the space station or all the flights which were made in order to build it up piece by piece. If you're interested in that, then you can go read more about it. The reference guide to the International Space Station is published by NASA. It's available online. Just google it. The link is very long. I write a blog about these things and similar lessons. Lessons from the Lunar Landing chateau to site reliability engineers I think there's a lot of things that NASA learned in the thousands which is relevant to the work that we do as site reliability engineers today. There's a lot we can learn from them, a lot of things we can inspired from them, and this is my collection of such lessons. NASA has actually created its own database, public database of significant incidents in human spaceflights. Again, a link down here from the perspective of what IBM is doing in this domain. Here are two links which will lead you down the rabbit hole into a lot of further information about modern service management and operations, site reliability engineering, AI operations, chat ups, which is my favorite, and so on. And one last link about kubernetes on the space station. IBM working together with NASA, with HP, with other partners in order to deploy a unique version of cloud computing far, far above the cloud. Thank you and enjoy the rest of the conference.

Robert Barron

AIOps, ChatOps & SRE @ IBM

Robert Barron's LinkedIn account Robert Barron's twitter account

Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways