Having fault-tolerance and high-availability is of no use if we lose application state during rescheduling.
Having state is unavoidable, and we need to preserve it no matter what happens to our applications, servers, or even a whole datacenter.
The way to preserve the state of our applications depends on their architecture. Some are storing data in-memory and rely on periodic backups. Others are capable of synchronizing data between multiple replicas, so that loss instance of one does not result in loss of data. Most, however, are relying on disk to store their state. We’ll focus on that group of stateful applications.
If we are to build fault-tolerant systems, we need to make sure that failure of any part of the system is recoverable. Since speed is of the essence, we cannot rely on manual operations to recuperate from failures. Even if we could, no one wants to be the person sitting in front of a screen, waiting for something to fail, only to bring it back to its previous state.