Tip:
Highlight text to annotate it
X
Hi.
My name is Steve Malmskog with Netskope.
And I'd like to welcome you to another Movie Line Monday.
Today's topic is health checking in the cloud.
And today's quote is, "It just so happens that your friend
here is only mostly dead."
That line comes from the movie The Princess Bride.
And is from one of the characters, Miracle Max, who's
played by Billy Crystal, trying to evaluate the
protagonist, Wesley, as to whether he's actually dead,
mostly dead, or actually alive.
And so we're going to talk about something very similar
here as we talk about health checking in the cloud.
And I want to talk about four basic areas.
One is the idea in-band health checking.
The other idea of out of band health checking.
And the kind of health checking or the nature of
health checking.
And then lastly, we want to talk a little bit about
collecting stats around the health checking.
So let's get started.
So if you have traffic coming in to your cloud, you have
several nodes in the cloud.
You're distributing that traffic across for a
particular request.
Maybe you have a load balance on the front.
You're going to web servers.
And as you fill out the page that's going to be returned,
you're making several requests out to different resources.
So these could be internal APIs.
Maybe those APIs are talking back to
data stores, et cetera.
But as that traffic flows through your cloud, there's
information that you can gather about the health of
each of the nodes based on just the traffic itself.
So for example, if you were making a request out to a
particular server here that may be an API, normally you
would get an API response back.
But in some cases, you might get something like a 501
error, or something like that.
Instead of just simply propagating that error back to
the client, one of the things that in-band health checking
does, and the idea behind it, is that this node here is
recording the statuses of things that are returning back
based on the normal flow of traffic that's
initiated from a client.
And so the first time that happens, you may still
consider the server as acceptable.
You consider it a transitory situation.
And if the next time things are normal, you
don't worry about it.
But eventually there might be some threshold where, maybe
after three times, you suddenly mark that server as
being down, and there's an actual problem.
But at the end of the day, in this model, you're using the
traffic driven by the client to determine the
health of the nodes.
In the second model, you have what's called out of band
health checking.
So out of band health checking, which I'll use a
different color here, is the idea that outside of the
normal traffic that's flowing, you're actually making
requests to these services and getting results back and
collecting those on some periodic basis.
So for example, every three, four, five, seconds, maybe
every ten seconds.
Some kind of network activity is occurring
between these two nodes.
You're checking whether or not that activity was successful.
And if it was, you're fine.
And typically, that type of health checking, the out of
band health checking, occurs when you have traffic that's
flowing that's not uniform across the cloud.
So you might have some services that are used less
than others.
You might have situations where you have traffic
patterns where certain times of the day, there are very low
traffic patterns.
And so all the servers aren't always being utilized.
In those kind of cases, an out of band health checking
becomes very useful.
But in addition to the health checking techniques
themselves, the other thing that's important to know or to
consider is the type of health checking you're doing.
What kinds of health checks are you making?
And so what's often done is that when a health check is
made, if someone asks, oh, is that server up, for example.
The first thing someone says is, I don't know.
Did you ping the server?
And so you jump on the server.
You do a ping.
But what a ping actually does is only telling you very
little about the server itself.
If we draw up here a quick stack of things that might be
going on in, say, an API service or some other
application service, the typical line runs somewhere
here, where you're talking about the kernel
being at this place.
And then user space is up here.
If you're making a ping request out to the server,
actually you're only going up to the ICMP layer.
And so if your ping responds back, essentially all you have
determined is that the kernel can respond with a nice
[INAUDIBLE] packet.
But it's very possible that this server can't serve
applications.
In that case, you have a server that's mostly dead.
And what you need is to know, is the server
actually fully alive?
And so to that end, if you're designing health checking
systems, do not to use ping as a mechanism for
doing health checks.
What you want is you want a full mechanism that's going
all the way up the stack to the application.
And as you encounter each of the layers, these logical
layers that are moving up, you're recording information.
For example, how long did it take to actually
make the TCP connect?
How long did the SSL handshake take to finish?
And then how long did the application
take to actually respond?
So if you incorporate all of this information, your health
check is much more valid.
And you know if something is really alive versus just
mostly dead.
Lastly, you need to think about, we have this cloud with
all kinds of nodes inside of them.
And you need to think about how to act on the information
that you're gathering in the cloud here.
So there's different nodes that you're
making requests to.
These are gathering stats about the
health inside the cloud.
But what you want to do is, as these are gathering
information, you need that information to be published to
some central location--
for example, some kind of stats collection.
We tend to use Stats D, for example, as an internal
technology.
It's very, very successful.
That gathers the information of what's going on and allows
us to attach alerting, dashboards, emails, et cetera,
all out from a central system here.
So that wraps up another Movie Line Monday.
If you have any questions about health checking in the
cloud, or another topic that you'd like to see, feel free
to email us at MovieLineMonday@netskope.com.
And again, I'm Steve Malmskog with Netskope.
And thanks for watching.