Netskope Mlm with Steve - Health checking for cloud performance

Hi. My name is Steve Malmskog with Netskope. And I'd like to welcome you to another Movie Line Monday. Today's topic is health checking in the cloud. And today's quote is, "It just so happens that your friend here is only mostly dead." That line comes from the movie The Princess Bride. And is from one of the characters, Miracle Max, who's played by Billy Crystal, trying to evaluate the protagonist, Wesley, as to whether he's actually dead, mostly dead, or actually alive. And so we're going to talk about something very similar here as we talk about health checking in the cloud. And I want to talk about four basic areas. One is the idea in-band health checking. The other idea of out of band health checking. And the kind of health checking or the nature of health checking. And then lastly, we want to talk a little bit about collecting stats around the health checking. So let's get started. So if you have traffic coming in to your cloud, you have several nodes in the cloud. You're distributing that traffic across for a particular request. Maybe you have a load balance on the front. You're going to web servers. And as you fill out the page that's going to be returned, you're making several requests out to different resources. So these could be internal APIs. Maybe those APIs are talking back to data stores, et cetera. But as that traffic flows through your cloud, there's information that you can gather about the health of each of the nodes based on just the traffic itself. So for example, if you were making a request out to a particular server here that may be an API, normally you would get an API response back. But in some cases, you might get something like a 501 error, or something like that. Instead of just simply propagating that error back to the client, one of the things that in-band health checking does, and the idea behind it, is that this node here is recording the statuses of things that are returning back based on the normal flow of traffic that's initiated from a client. And so the first time that happens, you may still consider the server as acceptable. You consider it a transitory situation. And if the next time things are normal, you don't worry about it. But eventually there might be some threshold where, maybe after three times, you suddenly mark that server as being down, and there's an actual problem. But at the end of the day, in this model, you're using the traffic driven by the client to determine the health of the nodes. In the second model, you have what's called out of band health checking. So out of band health checking, which I'll use a different color here, is the idea that outside of the normal traffic that's flowing, you're actually making requests to these services and getting results back and collecting those on some periodic basis. So for example, every three, four, five, seconds, maybe every ten seconds. Some kind of network activity is occurring between these two nodes. You're checking whether or not that activity was successful. And if it was, you're fine. And typically, that type of health checking, the out of band health checking, occurs when you have traffic that's flowing that's not uniform across the cloud. So you might have some services that are used less than others. You might have situations where you have traffic patterns where certain times of the day, there are very low traffic patterns. And so all the servers aren't always being utilized. In those kind of cases, an out of band health checking becomes very useful. But in addition to the health checking techniques themselves, the other thing that's important to know or to consider is the type of health checking you're doing. What kinds of health checks are you making? And so what's often done is that when a health check is made, if someone asks, oh, is that server up, for example. The first thing someone says is, I don't know. Did you ping the server? And so you jump on the server. You do a ping. But what a ping actually does is only telling you very little about the server itself. If we draw up here a quick stack of things that might be going on in, say, an API service or some other application service, the typical line runs somewhere here, where you're talking about the kernel being at this place. And then user space is up here. If you're making a ping request out to the server, actually you're only going up to the ICMP layer. And so if your ping responds back, essentially all you have determined is that the kernel can respond with a nice [INAUDIBLE] packet. But it's very possible that this server can't serve applications. In that case, you have a server that's mostly dead. And what you need is to know, is the server actually fully alive? And so to that end, if you're designing health checking systems, do not to use ping as a mechanism for doing health checks. What you want is you want a full mechanism that's going all the way up the stack to the application. And as you encounter each of the layers, these logical layers that are moving up, you're recording information. For example, how long did it take to actually make the TCP connect? How long did the SSL handshake take to finish? And then how long did the application take to actually respond? So if you incorporate all of this information, your health check is much more valid. And you know if something is really alive versus just mostly dead. Lastly, you need to think about, we have this cloud with all kinds of nodes inside of them. And you need to think about how to act on the information that you're gathering in the cloud here. So there's different nodes that you're making requests to. These are gathering stats about the health inside the cloud. But what you want to do is, as these are gathering information, you need that information to be published to some central location-- for example, some kind of stats collection. We tend to use Stats D, for example, as an internal technology. It's very, very successful. That gathers the information of what's going on and allows us to attach alerting, dashboards, emails, et cetera, all out from a central system here. So that wraps up another Movie Line Monday. If you have any questions about health checking in the cloud, or another topic that you'd like to see, feel free to email us at MovieLineMonday@netskope.com. And again, I'm Steve Malmskog with Netskope. And thanks for watching.