Sunday, October 11, 2015

Redundancy, Fault Tolerance, and High Availability

Whenever we think of keeping all of our systems up and running in an environment, we very often think about what can happen if we lose a server, if we lose a router, if we lose another component within our devices. So we have to think about redundancy and fault tolerance. These are very similar ideas, redundancy and fault tolerance. The idea is to keep things up and running and maintain uptime. We want to be sure that all of the systems, all of the things on our network– that we’re able to use all of the resources available to us and our company continues to function the way it should.
So we need to make sure, for instance, that we don’t have a hardware failure. We may want to have redundant servers. Or within a single server, we may want to have redundant power supplies. And so by keeping those redundancies of those systems, if we happen to lose a power supply or we happen to lose a motherboard in a server, we’ve got another one sitting right there, ready to take its place so that we can keep things up and running.
We also need to think about the software that we’re running on these systems. We may want to get software that’s able to notify us whenever there’s a problem, or work in conjunction with other pieces of software that might be running, perhaps in a cluster, so that if one particular piece of software fails, you’ve got other pieces of software running on the same network that are able to pick up the slack should that problem occur.
And we also want to be sure we don’t have any major system problems. Maybe we would like to have redundant routers. Maybe we’d like redundant firewalls. Maybe we would like redundant wide-area network links to the internet. You can obviously really apply different types of redundancy and fault tolerance to many environments. So by having these extra systems in place, we can always be assured that our systems will be available and up and running 100% of the time.
Now just because you have multiple servers or multiple systems– you’ve got that redundancy– doesn’t necessarily mean that your environment is highly available. High availability means that the systems will always be available regardless of what happens. With redundancy, you may have to flip a switch to move from one server to the other, or you may have to power up a new system to be able to have that system available. High availability is generally considered to be always on, always available.
If you have multiple high availability systems and you lose one, it doesn’t matter. Everybody continues to run because you’ve got an extra system ready to take up the extra slack, the extra load associated with that resource. There may be many different components working together to have this happen.
You may have extra and multiple wide-area network connections with multiple routers, with multiple firewalls, with multiple switches going to multiple servers, and they’re all working together and in conjunction. Each one of those sections would be set up the high have high availability so that if any particular one of those failed, all of the other components can work together to keep the resources up and running in your organization.
Now redundancy and fault tolerance means that we’re going to need to have redundant hardware components. So you can already think about having multiple power supplies, maybe having multiple devices available for us to use. We might also want to have multiple disks. Within a single server, in fact, you can have something called RAID, which is a Redundant Array of Independent Disks. And this RAID methodology means that if we lose one disk, we have options to keep the system up and running without anybody ever knowing that there was a problem with that piece of hardware.
Another piece of hardware we may want to have– because we’re never quite certain how power is going to be in our environment– is something called on uninterruptable power supply. You’ll hear this referred to as a UPS. If we ever lose power, these UPS systems have inside of them the batteries and some other method to keep things up and running.
And those UPS systems can be extremely valuable, especially if you’re in an environment where power is always a little sketchy. You may be in the southern United States during the summer where there are a lot of thunderstorms. Power goes on and off all the time. You almost require a UPS on your system to make sure things are available to you.
If you want to be sure that resources running on a server are available, you may want to consider clustering a number of servers together. That way if you lose a motherboard, if a system becomes unplugged. If you have a system piece of software in a system fail, you can have these extra systems in your cluster to keep everything up and running. And since all of those cluster machines are all talking to each other, they know if there’s an outage and they’ll be able to take those resources and make sure that everybody is able to run all of the systems that they need to run.
You often see systems very, very often load balancing these things. It’s very important. If you have multiple systems in place, you want to have all of them running all the time so that you’re balancing the load between them. And if you lose one, everybody will flip over to the other. Because the load is being balanced, you’ll want to make sure that you have additional resources available on that original machine so that it’s able to keep up with the load. It’s a lot like having multiple engines on a plane. If you lose one engine, you know that extra engine on the plane is designed to be able to keep that plane in the air until you’re able to get it down on the ground safely.
I mentioned that Redundant Array of Independent Disks that you might have inside of a single server. There’s different types of RAID out there. This chart shows you an idea of the primary kinds that you’ll run into. RAID 0 for instance, is a method called striping without parity. What that means is you have multiple disks, and parts of the files are copied to those multiple disks, but only part of the file, which means that we’re able to have very high performance because we’re writing tiny pieces to many different files at the same time. The problem is, there’s no parity, which means if we lose any one of those disks, the entire system is unavailable to us. So there’s no fault tolerance associated with that at all.
Another RAID type is RAID 1, or mirroring, where we are exactly duplicating this information across multiple disks. So if I have a 2 terabyte disk, I’ll have a duplicate 2 terabyte disk that has exactly the same information on it. If I lose the first disk, it continues to run, because now we’re fault tolerant. I can use the exact copy of that disk in RAID 1.
RAID 5 is very similar to RAID 0. It is striping, but it includes an extra drive for parity data, which means I’m not getting an exact duplicate of the data, but if I lose any of those drives, I still have a way to fault tolerantly retrieve all of that data from the disks. This is a pretty advanced system to be able do something like that, but it means that if I lose any physical drive, I’m still up and running. And I’m not using the exact duplicate amount of data that I have in RAID 1. So we’ve got some efficiencies there in the amount of storage in our systems.
Occasionally, you’ll see these RAID systems combined with other systems. You might have striping without parity, but you’ll mirror that striping. Or you’ll mirror the data and have it striped to a parity disk or striped to a non-parity system. So you’ve got different options where you can combine these things together. So you often see RAID 0 plus 1 or RAID 5 plus 1, where you are doing striping with parity and mirroring all at the same time. A lot of flexibility there, and if you’re building these file systems in your servers, you’ll want to check and see what RAID options might be available for you.
I mentioned server clustering. That’s a really useful way to keep systems up and running, and to provide availability 100% of the time. In an active/active server cluster, all of your end users are out here accessing different servers in your environment. And these servers are always active with each other. They’re constantly communicating between each other so that the two systems know if they’re available and running. And then you have behind-the-scenes storage that both of these systems will share. The idea is that if you lose one section of this cluster, everybody can still go right to the other active side of the cluster to be able to use those resources.
An active/passive cluster is a little bit different. In active/passive, you have one system that is always active and one system that is always passive. The passive system is sitting there and doing nothing. It is waiting for a problem to occur. These clusters are always talking to each other and making sure they’re up and running. And if Node 2 notices that Node 1 has disappeared, that the active system is no longer there, it automatically makes itself available to the world. And now all of the clients begin using the backup or the passive system to be able to perform whatever function they need across this network.
Active/passive systems are generally much easier to implement, because these are exactly the same type of systems. Active/active tends to be a little bit more complex to implement, because you now have multiple systems talking to multiple servers simultaneously. There has to be a way to keep track of that and make sure that everybody’s talking to the right systems at one time. But whether you’re using active/active or active/passive, you have systems that are redundant and available should there be any problems on your network.
If you’re planning to have redundant systems, you may not have them all running the same way. You may have cold spares, which means you’ve bought an additional server, but you’re keeping it in a box in a storeroom somewhere. You may have 10 servers sitting in the rack, and if any of those 10 servers fails, you can go to the storeroom, pull your one spare out of there– your cold spare– put that in the rack. And then of course, you have to configure it, because this is a fresh configuration.
You may want to have something called a warm spare, which means that spare is something that you might have even put into the rack. You’ll occasionally have it turned on. You may have it updated with the latest software, updated with your configurations. That way if you do have a problem, you simply flip a switch, turn it on, or plug it in. And now that warm spare is ready to go. You don’t have to now perform any additional configurations or load any additional software to get that running.
And obviously, your last option is a hot spare. It’s always on. It’s always updated. In many cases, it’s designed to automatically take over should there be a problem. So if you do have a problem with the system, it goes down, you can immediately move to the hot spare and it has an exact duplicate, an exact updated system that everybody can now use to perform the function that they need on your network.


No comments:

Post a Comment