Tip:
Highlight text to annotate it
X
Hi and Good Afternoon. This is Bill Alderson, thank you for joining us. Iím going to do
a presentation on Network Management in the Theater of War. This is a presentation that
I did at the MilCIS conference last year in Australia. It was a lot of fun, going Down
Under and talking to some of our military counterparts down there about network management-
Had a good time! Anyway, appreciate you joining us today.
So a little bit about me, I consider myself an application and network performance analysis
advocate. I know that end users are the most important folks in my career. I focus on working
toward their benefit in everything that I do. I started out at Lockheed in Sunnydale,
California, the heart of the Silicon Valley, and was there from 1978 to 1984. I was a communications
analyst. I was a young guy that loved communications, and was playing with every kind of computer
that was coming out at that time- IBM, Apple, others.
I was right in the middle of it. It was a lot of fun. A little later on, I got working
with Network General and the Network General Sniffer, at the start-up. Then I took my experience
with the Sniffer, and I founded Pine Mountain Group, and created Network General Sniffer
Training, which Network General licensed from me, in perpetuity. Then Pine Mountain Group,
we trained about 50,000 people in 22 countries and certified over 3,000 network forensic
professionals.
Iíve done a lot of work with 75 of the Fortune 100, federal and state government agencies.
I did a large event for Net World InterOp, called Network Forensics Day. I sold Pine
Mountain Group to NetQOS in 2005, and continued on as was the Technology Consulting Officer.
CA then acquired NetQoS, and I hung around as principle services architect. Then just
this last year I founded Apalytics Corporation. Iíve got several customers at Apalytics.
We did a sub-contract to U.S. CENTCOM, where we went and did analysis of pretty much the
entire war area.
Thatís what Iím talking about, and some of the things I learned through that experience,
of looking through every network control center throughout CENTCOMís AOR, in all those various
countries. Then working with those systems and helping to upgrade and architect, document
their systems so we could perform better network management, etc. Iím also consultant to the
OSD-CIO, at the Pentagon in area communications. I also provide services to the Department
of Justice. Iíve got a pretty wide variety of experience. The talk today, is about network
management, but in order to understand network management, one must understand everything
from end to end.
From the client, where youíve got the user, the business transactions, the security access,
the application interface, the operating system, the computing platform, the OSI model. The
seven layer model that brings it down to where you can take all of your data across disparate
networks and systems, and then have it arrive at the server and get error checked and that
sort of thing.
Weíre basically looking at the full spectrum of everything end to end, from client to server.
Well, one of the things in the war area is that thereís a lot of different people, and
a lot of different organizations, that are in charge of different packets. From where
the war fighter puts the packet on the wire and until it gets to the server.
So consequently, in that traversing all of these different network management zones,
and different countries, continents, going up to space and back, these packets are really
mistreated. Everyone along the path has their own SLAs, their own network management systems,
and theyíre all different. By golly, if you go through and you talk to anybody, all the
contractors and everybody involved, everybody did a wonderful job, and all met their SLAs.
Everybody inside those silos is very happy- with one exception, the end user.
The end user is going to cross all these dissimilar systems, and when heís got a problem, where
do you go to capture the packet, if it doesnít get there? Who do you blame? Who do you call
up, who do you talk to? The end to end analysis that Iíve done from all those areas, back
to CONUS for different applications and that sort of thing, basically said, ìHey man.
If youíre an end user, you have no advocate. You have no centralized advocate.î Itís
very, very difficult. You can call your local guy and heíll do local things, but he always
comes back and says, ìMust be somewhere else.î Now your end user is in trouble. So those
end users are usually the ones that contact me. Powerful, high visibility, high stakes
end users, usually call me up and say, ëHey.í Thatís exactly what the military did. They
called me up to have me help with a big application, and we went out and looked at all their network
management and said, ìWell, you donít know where any of your packets are going. You need
some NetFlow information.î
We implemented NetFlow across the entire AOR so we could see what applications were running
and where. In particular- one biometric application, it was very interesting; it was awesome to
analyze this application. I helped the programmers actually recode several parts of their application
to improve performance and that sort of thing. It was a really great thing. So network management
does not just involve the purchase of some network management tool then turning it on
and installing it. It involves understanding the clients that are accessing the applications
that are going through it. So you can fine tune it, nurture it. So you can find the signatures
of the vital signs of those applications. When somebody tells me about network management
and oh yeah, Iím a network management guy. Well, do you know applications? Do you know
the issues on the client? Do you know the issues on the server?
Do you see the virtualization issues? Do you see all the man-in-the-middle, all the firewalls,
all the WAN optimizers? Do you see all the load balancers and that sort of thing? Do
you understand how those things come into practice? Do you understand the quality of
service across that end to end network? ìWell, I know everything there is to know about this
network management package. I can install it, and Iíve installed it.î Yeah, but in
order to install and use these systems, you must intrinsically understand what it is youíre
trying to help the clients achieve; and how to measure the performance of those systems
all along the way. Youíve got to document your systems.
Iím going to talk to you a bit about, after we get off this slide, a number of things.
Weíve got a lot of slides to cover today. But weíre going to get there very quickly
and give you a cursory overview. Hereís your basic client to server, so the client is separated
from the server. We started doing that back in the late 80s. You took your client and
your applications, you run it across various stacks, using HTTP or Java. You go across
your transport network, you come up the other side, you go into your server. You pop back
up your stack and into your respective upper layer system. Then you go into your applications
and processes on your server.
You have to draw this out. You have to know where your connection points are. You have
to know where you are in the universe. If youíre having a problem somewhere, you need
to know exactly how youíre going to hook in, where youíre going to hook in, what kind
of tool youíre going to use, so that you can deconstruct the problem.
Now, people going into the cloud and thatíll be another topic that we talk about. Cloud
computing can be very problematic. Because, with the client and the server, you own all
of the infrastructure; but when you put that server over in the cloud you have no access
to packets, network management information, statistics, etc., other than what that vendor
gives. Itís going to be a cloud all right. So basically, Iím trying to help you understand
that when you go in to analyze your environment, you need to know from end-to-end.
Hereís a client talking to an app server or an HTTP server. The back end is talking
SQL or some other type of middleware or back end process. Well you have to know where to
go to connect your network management systems for packet capture, for metrics, for analysis.
So that you can understand, and if anything goes wrong, which it will at some point or
another, and youíll need to get macro information about the use of the network, the use of the
platform, the CPU, etc., of all the different systems. You need to see the entire system
from end-to-end, and understand all the various components so that you can basically design
the vital signs for the applications and systems. Itís not as simple as just being, okay, Iím
monitoring layer three of the network. You have to be a little bit higher than that,
you have to help your customers. Thatís my definition of network management.
Now typically at one time or another, whether during implementation or later on, youíre
going to have a problem. Youíre going to be looking for that proverbial needle in a
haystack. Hereís your needle in a haystack. Itís out there, itís somewhere, letís take
a look where it might be. Hmm. It could be anywhere in Afghanistan, Iraq or some point
in between, or over somewhere in the United States or down to Fort Watuka. Your packets
are traversing all of these various environments. Where is it? Whatís slowing it down, whoís
holding it up? Whoís routing it. Whoís misrouting it? Whoís redirect it? Whoís changed it
in its path because of a WAN optimizer or load balancer that you were unaware of? Those
are your men-in-the-middle, and theyíre sitting out there impacting your environment. So you
need to know your environment.
Then every once and a while youíve got to gather some packet traces. Youíve got to
gather them from somewhere in between where the user is and their resultant service. So
it could be anywhere. We usually go and gather packet traces. We gather packet traces based
upon the macro information that we get from our network management system, to tell us
and help us triangulate where the problem might be. Thatís what we use network management
for, and then weíll go out and we will capture some traces and then get down to the bottom
line. Iíll just tell you right here, and you can all argue with me, you can fight with
me. If you were a network management person and if you are implementing large scale systems,
and you have no ability to capture packets across your path- You are impotent. You canít
do it, you never will be able to do it. Iím telling you if you donít have bottom line,
root cause deep packet analysis skills and capabilities, and the ability to capture those
packets. And you are impotent and you cannot do your job. Sooner or later youíre going
to come up against a problem that absolutely requires deep packet captures. You design
your network management systems to help you find the area, and then zero on it. Can you
solve many problems with network management systems and capacity? Absolutely. Can you
solve every one? No. Itís not a very pretty picture when youíre standing around with
several million dollarsí worth of network management tools, giving you charts and graphs
and all sorts of wonderful views of stuff, and you canít solve the problem.
Itís because you donít have deep packet capture capability. You donít have the root
cause analysis capabilities. Thatís what I focused my whole career on, trained 50,000,
and executed for 75 of the Fortune 100. To help them solve these types of problem. Then
some people say, ìI like packets so much Iím going to capture every single one and
Iím going to store them forever.î Then I ask, ìOkay. Youíre going to store every
packet everywhere? Youíve got unlimited amounts of money and resources and storage? Whoís
going to analyze those packets? Whoís qualified within your organization? How are you going
to filter in, and have the worst offending problems bubble up? Whatís your strategy
for using these systems?î I know a lot of folks have a lot of this sort of stuff up
in their environment, and they have all these tools and capabilities, but nobody ever goes
and looks at them. ìOh, but theyíre there in case we need to do retrospectives.î
Thatís good, and Iím not saying you shouldnít have any of it, but Iím saying that youíd
better make certain that if youíre going to buy this stuff, turn it on and start caching
packets all over the world, that you at least ought to have somebody whoís around who can
help you take a look at them and know when and why. So when you finally get down to the
stack with the problem, now you go to work in deep packet analysis, to deconstruct in
order to ultimately, boom, find the needle in the haystack. Thatís what itís all about,
is getting to the bottom line. Definitive results. Performance Orientation. Getting
the job done. No ifs, ands, wells or buts about the situation. In order to get all that
done, we use ITIL , we use People and Processes, Paradigms, tools, systems, platforms. We use
all of these various systems. Itís not just network management. Itís not just deep packet
analysis; itís not just application optimization. We have a whole portfolio of capabilities.
As Iíll talk about in some future, I have a list of these things that I believe are
the most important and weíll talk about those at some point. Actually, on the 30th of the
month weíre going to talk about that.
It requires a seamless integration of people, processes, knowledge and technology. Itís
a well-rounded organization. Iím going to talk to you a little bit about what I call
my network management servo-loop. Over on the left, youíve got a website, letís say.
Youíve got an application, and you want it to perform. Letís just say that you want
it to perform in a way that allows that webpage to come up within 15 seconds. So what you
do is over on the left hand side, you say I want the maximum possible performance to
accomplish this business objective. Iím going to set my command, my desired response time,
to be at 15 seconds. So that goes into the network management servo-loop. This is a feedback
servo-loop mechanism. If you havenít seen this before, you can look it up on the internet
and see some others. But I have taken this and adapted this, because itís a feedback
loop, and itís a servo-loop, itís a closed loop system. Where you say, this is what I
want and expect from the system. Then you have a resulting actual user experience that
you compare to what it is that your objective is. You come up with a signal deviation.
At the top left there, you have what your desire is, 15 seconds. You then have your
feedback signal, which is letís say it takes 25 seconds. You have a signal deviation of
ten seconds. You are not meeting your SLA or your desired performance by ten seconds.
So what do you do? In most environments, executive management thinks all the technologists have
it all figured out, and the technologists think that the executive management have it
all figured out. Iíve got news for you. Iíve been working in this industry for 20 some
odd years. Neither one of them really knows, nor have a systematic approach, until you
get them into the servo-loop. Letís get the executives in there. Letís show them some
metrics, show them some business metrics.
When you talk to an executive about their business metrics and how it coincides with
the network and the application metrics, they start listening. You donít lose them in presentations;
you have their full, undivided attention, because youíre talking about deliverables
for the business. I always like to get metrics, feedback sensors. Metric are a vital part
of the business. Performance indicators. End user reports. Right? You get your executives
involved, because why? Because executives control the policy, procedures, the resources,
and the entire organization. They need to be involved in this. Okay?
You take a guy like Steve Jobs. He was involved in the process. He understood his companyís
technology. Then you take another guy like John Scully, and you put him in there whoís
selling sugar water for years. Heís a bottom line dollars guy, how much can we make? But
in technology, you have to have someone involved in the technology. If your executives are
not involved in the technology, itís your job as the CIO, or the executive technologists
to get them involved in a meaningful way. The way that you do that is you pop in some
business metrics and compare them to some of your technical metrics, and youíll bring
them right in. Then theyíll get involved and help you solve the problems that you have
by allocating resources, priorities, focus, policy and budget.
Then youíve got the actions. The actions are carried out by the technologists and the
technology folks. They go out and they apply. Theyíre the tinkerers; theyíre the guy behind
the curtain, changing this ***, changing that ***. Optimizing and installing and improving
the environment in the system. After executive management puts the controls on there, the
technologists apply those priorities. Then we should measure the actual user experience
again. We measure that actual user experience, it comes back, Okay. Then little by little,
between the executive function and the technology functions, we start to move the servo-loop
toward the maximum possible performance. We keep tweaking it, but it does involve executives
& management, and it does involve the technologists. If the executives are not involved, you wonít
have the type of priority and prioritization and resources necessary.
Okay. Now letís talk performance indicators. Well, these are the three main performance
indicators that I advocate for. Youíve got network flows. Network flows are from the
left, youíve got clients and youíve got an IP address, and youíve got a socket. Those
are the application sockets. If you know every packet that goes across your network, because
every router in the network is reporting on these network flows. You donít need another
gismo to pop in there. Those routers perform the actual task of looking at all those ***
connections, from IP address on the left at the client, and IP address on the right at
the server, and the application ports that theyíre using between those two. Then they
tell you and record the rate and flow volume information. I can tell you if an application
is or isnít working well, and how many people are using it. Why? Because I have rate and
flow information, and itís coming from my routers, and itís going to a database, getting
in the database, and I can go back and tell you what applications are on your network.
That was very interesting in the war environment, because there were certain applications that
were consuming large amounts of bandwidth and others that were consuming hardly any
bandwidth. Some of them were being accused of consuming bandwidth that they werenít.
So when you really go out and get the facts of whoís talking to who through whom, thatís
bottom line. Thatís what net flow does. Who is talking to whom through whom?
Then youíve got rate and volume information, youíve got capacity planning, youíve got
all of those sorts of things available at your fingertips. Net flow and network flows
is a very important thing. It tells you where those applications are running around the
world, and who is running those applications by location. And, the volume and the rate
can be compared, and tell you what type of performance they're getting. If you understand
the theory, you understand the network management. So again, this is not about turning up a product.
This is not about clicking and installing a product. This is not about buying it, installing
it and having it there. Itís about interpreting it, itís about looking at it, and itís about
architecting it. Itís about understanding the applications; itís about understanding
the vital signs. And itís about trying to figure out whatís going wrong and where,
and being able to triangulate that and being able to solve those typical problems.
Okay. So then youíve got standard SNMP, thatís device status. Standard SNMP just tells you
a routerís overloaded with CPU, or a circuit is overloaded or an interface is overloaded.
Itís very good information. But what if you donít even know the path that your packets
are taking? Youíve got to know the path your packets are taking. If you know your path,
then you can exploit device status, to find out what devices on your particular path are
impacted by memory, by CPU, by other types of capacity limiting characteristics. The
first thing is, youíve got to know your path. Well, there are not very many tools out there
or capabilities at layers two and three. You can give a layer three path, but itís hard
to get a layer two path as well.
Youíve basically got the network flows. Youíve got device status, and then third, response
time. Response time monitors, thereís several different flavors. AuthNet has some really
nice stuff. NetQoS, which was acquired by CA, has some nice stuff. There are a couple
of other players in the field that I donít have a lot of confidence in quite yet, but
itís growing. The performance marketplace is no longer about red, green, up, down. Itís
about, ëitís slowí, ësomethingís wrongí, ëwe donít know where, itís working, but
not very wellí and ëitís impacting our end usersí. So you put these response time
monitors, which basically capture the packets or listen to the packets as they go into the
server. They timestamp it at the server, and they say this packet response time took 100
milliseconds.
You can put timers on these things and thresholds, so that if your response time is slow, you
can automatically trigger a trace file. You donít have to wait until 2:00 in the morning.
You can actually have your response time monitor over monitoring a bank of servers, and when
the response time gets slow it automatically triggers a packet capture, so your technologists
can go back in the next day and do retrospective analysis on the packets that were slow. So
do we care if everything is fast? Do we want to capture all those packets? No. What do
we want? We want to trigger on the events that are poor, and thatís what these systems
do. They record the response time, itís an awesome system. It also gives you the network
round trip, because the [inaudible 26:54] come back and it records that so you can see.
One of the problems that we had was satellite delay and multiple router hops adding satellite
delay. Itís about 700 milliseconds round trip across a satellite. Well, when I see
a response time that says itís 2.1 seconds, I know that that 2.1 seconds is because theyíve
gone across three satellite hops at three times 700. You can figure out and start to
surmise whatís going on, and then you can look at optimizing your routing.
These are your tools and your key issues. Youíve got your performance management system.
Youíre getting your performance metric, the feedback, and your business metrics. Youíre
getting your trouble tickets; youíre getting your information back. Some of the things
that you need to look at are route analytics, to identify instability in routes, server
response time, net flow base, and rate and volume information. It gives you hints on
conditions limiting user experience. Also look at SMNP device status. Well, youíve
got to know the path in order to know what devices are in that path, and because you
may have hundreds or thousands of different devices. Knowing what your particular path
is and exploiting that path, to see if anything in that path is failing. Those are value added
steps. Then of course packet capture at key locations, so you can get down to root cause
analysis.
Hereís a network management architecture diagram. Thereís a portfolio of tools. Youíve
got Remedy Trouble Tickets; youíve got Opsware, ArcSight. Youíve got a large number of different
systems-Packet Design's route analytics, NetQoSí response time monitoring, and NetFlow tools,
OPNET's Network Diagramming capabilities. Thereís a portfolio. Itís a Swiss army knife,
all working together. You canít use just one company or one suite of products. You
can reduce that set as much as you possibly can. But you always want to make certain that
youíve got every best of breed and aspect of your system. This is where we went out
and we installed systems at 15 different sites across Afghanistan, Iraq and areas in the
United States so that we could have NetFlow and response time monitoring. This is how
long it took us. It took us one year to install. This was pretty much unheard of, and the ability
to get this system up and running and actually do analysis on major applications in fewer
than 12 months, inside a war environment, was unheard of. We had an incredible team
down at CENTCOM out there. The war fighters, man, they were working their buns off to try
to help us get all of this stuff, so we could figure out what was wrong with this biometric
application. This biometric application by the way takes fingerprints and iris scans.
And it looks at latent prints of the guys who were building bombs and that sort of thing,
those IEDs. That application was helping them put those guys on an alert list.
A war fighter would have a bad guy walk through, or people would be walking through their security
station going into Fallujah or into Baghdad or whatever. And they would immediately pick
up on the fact that thousands of people were coming through, and we just found the guy
who had a latent print on an IED. This guy is now caught. Thatís how the surge actually
worked in my opinion, was because we had electronic means to zero in on whoís a bad guy and whoís
in proximity and whoís not. This application and the analysis of and the fixing of this
application really helped the effort over there. We found route changes, due to packet
loss, slow server response time. We found problems with TCP offload engine. TCP offload
engine is where the TCP stack is moved from the server, inside the server, using the serverís
CPU and memory, off onto the network interface card. There was some problems, and Iíll show
you a couple of slides here on some of the examples of this sort of stuff. Of course,
this took many weeks and months of analysis after we had all this stuff put together.
But we found all these problems and started mitigating them, and things improved.
And that application I was telling you about was even more effective in performing.
Across the network there are qualities of service incongruities. Network and application
issues requiring packet level capture, we analyzed a lot of stuff. Hereís a few screenshots
of some of the stuff. Hereís some route change analytics going on. Hereís some satellite
retransmission, took three and a half seconds to retransmit across many of those satellite
circuits. This is a detailed analysis of that sort of stuff. Hereís a TCP offload engine
recovery issue. These are all commercial, off the shelf stuff. In the U.S. or other
parts of the world, these particular systems are not always found to be wanting in a low
latency environment. But you put them in a satellite environment with high latency, and
manifestations of severe problems start to show up. That wouldnít show up otherwise.
We found TCP offload engine problems. We also found WAN optimizers and load balancers and
other things, doing weird and funny things. I called them our own man-in-the middle. I
developed a method by which we could identify that man-in-the-middle, and this is some of
the analysis associated with that.
Then processing analysis.- How long it takes? Over on the left, you see processing, processing,
processing. That was before we optimized the application. Over on the right, the same transaction
only took under two seconds instead of 18 seconds. You add those types of optimizations
up and youíve got a lot of optimization. Then there was data duplication. What does
data duplication mean? Data duplication is when you have a problem with that TCP offload
engine where youíre sending the same data multiple times. This is an example of where
the data is being sent across the network. So not only did you have clogged pipes, not
only did you have networks that were saturated, but now youíre sending multiple copies of
the same data. Here is an example of that packet loss induced TCP offload problem, and
the wasted bandwidth associated. Youíll see a red line and a green line. A red line was
the wasted bandwidth; the green line is what was normally required. But in this particular
environment, because of that TCP offload engine and the packet loss that was associated with
it that triggered it or was the catalyst.
See, if you had no packet loss, a lot of these problems donít manifest. When you do have
packet loss you end up with three and a half second response times. You end up with data
duplication. There are a lot of things that are exacerbated, because in a war environment
you are not dealing with perfect commercial communication systems. So bottom line. Multi-tier
identification of your environment; you have to know and document your network, and then
you set your test points up. Your front end tier, your middleware tier, your SQL tiers;
you have to know how to find and design your network management and your response time
monitoring and NetFlow systems. In order to do that, you have to document very effectively.
Then you can instrument your front end, your middleware tier, your back end tier, and your
mainframe tiers. You can then take and instrument your systems.
This is basically showing test point two, where we pulled over to a super-agent to do
response time monitoring, with the ability to capture those packets like I talked to
you about, when response times get slow, it would automatically capture packets at that
location. Pull it all together, what do you have? A user clicks on their screen on the
top left. You see where it says ëuser click?í Then it comes down. Over on the right, the
different colors, red, green, blue, those represent processing times, network serialization,
transport and switching queue times. Then, of course security authentication- How long
does it take to authenticate the user to be able to access that data? If you take a gander
up at the top left, he clicked, and on the bottom right the user finally gets the information.
Did that take 15 seconds or 25 seconds or 71 seconds, what have you? Then basically
take it with the next step, and you bring it down to the various tiers. The process,
where is it? We fixed the process problem in the application, reducing the response
time. This is an example of what some of those response time monitors look like over there.
Youíve got your web tier monitor at the same time as your app tier, your SQL tier, your
mainframe tier and we found the process at the app tier that was causing a lot of issues,
and we helped resolve those.
I know Iím going just a couple minutes over today, but I just wanted to finish this out.
The bottom line is, in order to find that needle in a haystack, in order to make certain
that your users are productive; you have to design your systems not to find problems necessarily,
but to nurture and build a system whereby you are obviating the problem. In other words,
you are fixing the problem because you have such good intelligence on the information.
In ëThe Art of War,í the first thing thatís said, is know yourself. Know Yourself. Document.
Know Yourself, Monitor Yourself. Know where you are, so that when you do have a problem,
you can go over and you can find that needle in a haystack, using scientific, automated
capabilities. The future of network-centric warfare is dependent upon having these types
of capabilities.
Just buying a bunch of stuff and installing it is not what is needed. You have to have
architects who are looking at this, who understand the big picture, and who can solve problems
in all manners, from client to server and all points between. Anyway, if thereís anything
we can do to help you; we know how to do it. And weíre at your service. It was good being
with you today. This is Bill Alderson, signing off. Appreciate it.