Network Application Performance in The Theatre of War

Hi and Good Afternoon. This is Bill Alderson, thank you for joining us. Iím going to do a presentation on Network Management in the Theater of War. This is a presentation that I did at the MilCIS conference last year in Australia. It was a lot of fun, going Down Under and talking to some of our military counterparts down there about network management- Had a good time! Anyway, appreciate you joining us today. So a little bit about me, I consider myself an application and network performance analysis advocate. I know that end users are the most important folks in my career. I focus on working toward their benefit in everything that I do. I started out at Lockheed in Sunnydale, California, the heart of the Silicon Valley, and was there from 1978 to 1984. I was a communications analyst. I was a young guy that loved communications, and was playing with every kind of computer that was coming out at that time- IBM, Apple, others. I was right in the middle of it. It was a lot of fun. A little later on, I got working with Network General and the Network General Sniffer, at the start-up. Then I took my experience with the Sniffer, and I founded Pine Mountain Group, and created Network General Sniffer Training, which Network General licensed from me, in perpetuity. Then Pine Mountain Group, we trained about 50,000 people in 22 countries and certified over 3,000 network forensic professionals. Iíve done a lot of work with 75 of the Fortune 100, federal and state government agencies. I did a large event for Net World InterOp, called Network Forensics Day. I sold Pine Mountain Group to NetQOS in 2005, and continued on as was the Technology Consulting Officer. CA then acquired NetQoS, and I hung around as principle services architect. Then just this last year I founded Apalytics Corporation. Iíve got several customers at Apalytics. We did a sub-contract to U.S. CENTCOM, where we went and did analysis of pretty much the entire war area. Thatís what Iím talking about, and some of the things I learned through that experience, of looking through every network control center throughout CENTCOMís AOR, in all those various countries. Then working with those systems and helping to upgrade and architect, document their systems so we could perform better network management, etc. Iím also consultant to the OSD-CIO, at the Pentagon in area communications. I also provide services to the Department of Justice. Iíve got a pretty wide variety of experience. The talk today, is about network management, but in order to understand network management, one must understand everything from end to end. From the client, where youíve got the user, the business transactions, the security access, the application interface, the operating system, the computing platform, the OSI model. The seven layer model that brings it down to where you can take all of your data across disparate networks and systems, and then have it arrive at the server and get error checked and that sort of thing. Weíre basically looking at the full spectrum of everything end to end, from client to server. Well, one of the things in the war area is that thereís a lot of different people, and a lot of different organizations, that are in charge of different packets. From where the war fighter puts the packet on the wire and until it gets to the server. So consequently, in that traversing all of these different network management zones, and different countries, continents, going up to space and back, these packets are really mistreated. Everyone along the path has their own SLAs, their own network management systems, and theyíre all different. By golly, if you go through and you talk to anybody, all the contractors and everybody involved, everybody did a wonderful job, and all met their SLAs. Everybody inside those silos is very happy- with one exception, the end user. The end user is going to cross all these dissimilar systems, and when heís got a problem, where do you go to capture the packet, if it doesnít get there? Who do you blame? Who do you call up, who do you talk to? The end to end analysis that Iíve done from all those areas, back to CONUS for different applications and that sort of thing, basically said, ìHey man. If youíre an end user, you have no advocate. You have no centralized advocate.î Itís very, very difficult. You can call your local guy and heíll do local things, but he always comes back and says, ìMust be somewhere else.î Now your end user is in trouble. So those end users are usually the ones that contact me. Powerful, high visibility, high stakes end users, usually call me up and say, ëHey.í Thatís exactly what the military did. They called me up to have me help with a big application, and we went out and looked at all their network management and said, ìWell, you donít know where any of your packets are going. You need some NetFlow information.î We implemented NetFlow across the entire AOR so we could see what applications were running and where. In particular- one biometric application, it was very interesting; it was awesome to analyze this application. I helped the programmers actually recode several parts of their application to improve performance and that sort of thing. It was a really great thing. So network management does not just involve the purchase of some network management tool then turning it on and installing it. It involves understanding the clients that are accessing the applications that are going through it. So you can fine tune it, nurture it. So you can find the signatures of the vital signs of those applications. When somebody tells me about network management and oh yeah, Iím a network management guy. Well, do you know applications? Do you know the issues on the client? Do you know the issues on the server? Do you see the virtualization issues? Do you see all the man-in-the-middle, all the firewalls, all the WAN optimizers? Do you see all the load balancers and that sort of thing? Do you understand how those things come into practice? Do you understand the quality of service across that end to end network? ìWell, I know everything there is to know about this network management package. I can install it, and Iíve installed it.î Yeah, but in order to install and use these systems, you must intrinsically understand what it is youíre trying to help the clients achieve; and how to measure the performance of those systems all along the way. Youíve got to document your systems. Iím going to talk to you a bit about, after we get off this slide, a number of things. Weíve got a lot of slides to cover today. But weíre going to get there very quickly and give you a cursory overview. Hereís your basic client to server, so the client is separated from the server. We started doing that back in the late 80s. You took your client and your applications, you run it across various stacks, using HTTP or Java. You go across your transport network, you come up the other side, you go into your server. You pop back up your stack and into your respective upper layer system. Then you go into your applications and processes on your server. You have to draw this out. You have to know where your connection points are. You have to know where you are in the universe. If youíre having a problem somewhere, you need to know exactly how youíre going to hook in, where youíre going to hook in, what kind of tool youíre going to use, so that you can deconstruct the problem. Now, people going into the cloud and thatíll be another topic that we talk about. Cloud computing can be very problematic. Because, with the client and the server, you own all of the infrastructure; but when you put that server over in the cloud you have no access to packets, network management information, statistics, etc., other than what that vendor gives. Itís going to be a cloud all right. So basically, Iím trying to help you understand that when you go in to analyze your environment, you need to know from end-to-end. Hereís a client talking to an app server or an HTTP server. The back end is talking SQL or some other type of middleware or back end process. Well you have to know where to go to connect your network management systems for packet capture, for metrics, for analysis. So that you can understand, and if anything goes wrong, which it will at some point or another, and youíll need to get macro information about the use of the network, the use of the platform, the CPU, etc., of all the different systems. You need to see the entire system from end-to-end, and understand all the various components so that you can basically design the vital signs for the applications and systems. Itís not as simple as just being, okay, Iím monitoring layer three of the network. You have to be a little bit higher than that, you have to help your customers. Thatís my definition of network management. Now typically at one time or another, whether during implementation or later on, youíre going to have a problem. Youíre going to be looking for that proverbial needle in a haystack. Hereís your needle in a haystack. Itís out there, itís somewhere, letís take a look where it might be. Hmm. It could be anywhere in Afghanistan, Iraq or some point in between, or over somewhere in the United States or down to Fort Watuka. Your packets are traversing all of these various environments. Where is it? Whatís slowing it down, whoís holding it up? Whoís routing it. Whoís misrouting it? Whoís redirect it? Whoís changed it in its path because of a WAN optimizer or load balancer that you were unaware of? Those are your men-in-the-middle, and theyíre sitting out there impacting your environment. So you need to know your environment. Then every once and a while youíve got to gather some packet traces. Youíve got to gather them from somewhere in between where the user is and their resultant service. So it could be anywhere. We usually go and gather packet traces. We gather packet traces based upon the macro information that we get from our network management system, to tell us and help us triangulate where the problem might be. Thatís what we use network management for, and then weíll go out and we will capture some traces and then get down to the bottom line. Iíll just tell you right here, and you can all argue with me, you can fight with me. If you were a network management person and if you are implementing large scale systems, and you have no ability to capture packets across your path- You are impotent. You canít do it, you never will be able to do it. Iím telling you if you donít have bottom line, root cause deep packet analysis skills and capabilities, and the ability to capture those packets. And you are impotent and you cannot do your job. Sooner or later youíre going to come up against a problem that absolutely requires deep packet captures. You design your network management systems to help you find the area, and then zero on it. Can you solve many problems with network management systems and capacity? Absolutely. Can you solve every one? No. Itís not a very pretty picture when youíre standing around with several million dollarsí worth of network management tools, giving you charts and graphs and all sorts of wonderful views of stuff, and you canít solve the problem. Itís because you donít have deep packet capture capability. You donít have the root cause analysis capabilities. Thatís what I focused my whole career on, trained 50,000, and executed for 75 of the Fortune 100. To help them solve these types of problem. Then some people say, ìI like packets so much Iím going to capture every single one and Iím going to store them forever.î Then I ask, ìOkay. Youíre going to store every packet everywhere? Youíve got unlimited amounts of money and resources and storage? Whoís going to analyze those packets? Whoís qualified within your organization? How are you going to filter in, and have the worst offending problems bubble up? Whatís your strategy for using these systems?î I know a lot of folks have a lot of this sort of stuff up in their environment, and they have all these tools and capabilities, but nobody ever goes and looks at them. ìOh, but theyíre there in case we need to do retrospectives.î Thatís good, and Iím not saying you shouldnít have any of it, but Iím saying that youíd better make certain that if youíre going to buy this stuff, turn it on and start caching packets all over the world, that you at least ought to have somebody whoís around who can help you take a look at them and know when and why. So when you finally get down to the stack with the problem, now you go to work in deep packet analysis, to deconstruct in order to ultimately, boom, find the needle in the haystack. Thatís what itís all about, is getting to the bottom line. Definitive results. Performance Orientation. Getting the job done. No ifs, ands, wells or buts about the situation. In order to get all that done, we use ITIL , we use People and Processes, Paradigms, tools, systems, platforms. We use all of these various systems. Itís not just network management. Itís not just deep packet analysis; itís not just application optimization. We have a whole portfolio of capabilities. As Iíll talk about in some future, I have a list of these things that I believe are the most important and weíll talk about those at some point. Actually, on the 30th of the month weíre going to talk about that. It requires a seamless integration of people, processes, knowledge and technology. Itís a well-rounded organization. Iím going to talk to you a little bit about what I call my network management servo-loop. Over on the left, youíve got a website, letís say. Youíve got an application, and you want it to perform. Letís just say that you want it to perform in a way that allows that webpage to come up within 15 seconds. So what you do is over on the left hand side, you say I want the maximum possible performance to accomplish this business objective. Iím going to set my command, my desired response time, to be at 15 seconds. So that goes into the network management servo-loop. This is a feedback servo-loop mechanism. If you havenít seen this before, you can look it up on the internet and see some others. But I have taken this and adapted this, because itís a feedback loop, and itís a servo-loop, itís a closed loop system. Where you say, this is what I want and expect from the system. Then you have a resulting actual user experience that you compare to what it is that your objective is. You come up with a signal deviation. At the top left there, you have what your desire is, 15 seconds. You then have your feedback signal, which is letís say it takes 25 seconds. You have a signal deviation of ten seconds. You are not meeting your SLA or your desired performance by ten seconds. So what do you do? In most environments, executive management thinks all the technologists have it all figured out, and the technologists think that the executive management have it all figured out. Iíve got news for you. Iíve been working in this industry for 20 some odd years. Neither one of them really knows, nor have a systematic approach, until you get them into the servo-loop. Letís get the executives in there. Letís show them some metrics, show them some business metrics. When you talk to an executive about their business metrics and how it coincides with the network and the application metrics, they start listening. You donít lose them in presentations; you have their full, undivided attention, because youíre talking about deliverables for the business. I always like to get metrics, feedback sensors. Metric are a vital part of the business. Performance indicators. End user reports. Right? You get your executives involved, because why? Because executives control the policy, procedures, the resources, and the entire organization. They need to be involved in this. Okay? You take a guy like Steve Jobs. He was involved in the process. He understood his companyís technology. Then you take another guy like John Scully, and you put him in there whoís selling sugar water for years. Heís a bottom line dollars guy, how much can we make? But in technology, you have to have someone involved in the technology. If your executives are not involved in the technology, itís your job as the CIO, or the executive technologists to get them involved in a meaningful way. The way that you do that is you pop in some business metrics and compare them to some of your technical metrics, and youíll bring them right in. Then theyíll get involved and help you solve the problems that you have by allocating resources, priorities, focus, policy and budget. Then youíve got the actions. The actions are carried out by the technologists and the technology folks. They go out and they apply. Theyíre the tinkerers; theyíre the guy behind the curtain, changing this ***, changing that ***. Optimizing and installing and improving the environment in the system. After executive management puts the controls on there, the technologists apply those priorities. Then we should measure the actual user experience again. We measure that actual user experience, it comes back, Okay. Then little by little, between the executive function and the technology functions, we start to move the servo-loop toward the maximum possible performance. We keep tweaking it, but it does involve executives & management, and it does involve the technologists. If the executives are not involved, you wonít have the type of priority and prioritization and resources necessary. Okay. Now letís talk performance indicators. Well, these are the three main performance indicators that I advocate for. Youíve got network flows. Network flows are from the left, youíve got clients and youíve got an IP address, and youíve got a socket. Those are the application sockets. If you know every packet that goes across your network, because every router in the network is reporting on these network flows. You donít need another gismo to pop in there. Those routers perform the actual task of looking at all those *** connections, from IP address on the left at the client, and IP address on the right at the server, and the application ports that theyíre using between those two. Then they tell you and record the rate and flow volume information. I can tell you if an application is or isnít working well, and how many people are using it. Why? Because I have rate and flow information, and itís coming from my routers, and itís going to a database, getting in the database, and I can go back and tell you what applications are on your network. That was very interesting in the war environment, because there were certain applications that were consuming large amounts of bandwidth and others that were consuming hardly any bandwidth. Some of them were being accused of consuming bandwidth that they werenít. So when you really go out and get the facts of whoís talking to who through whom, thatís bottom line. Thatís what net flow does. Who is talking to whom through whom? Then youíve got rate and volume information, youíve got capacity planning, youíve got all of those sorts of things available at your fingertips. Net flow and network flows is a very important thing. It tells you where those applications are running around the world, and who is running those applications by location. And, the volume and the rate can be compared, and tell you what type of performance they're getting. If you understand the theory, you understand the network management. So again, this is not about turning up a product. This is not about clicking and installing a product. This is not about buying it, installing it and having it there. Itís about interpreting it, itís about looking at it, and itís about architecting it. Itís about understanding the applications; itís about understanding the vital signs. And itís about trying to figure out whatís going wrong and where, and being able to triangulate that and being able to solve those typical problems. Okay. So then youíve got standard SNMP, thatís device status. Standard SNMP just tells you a routerís overloaded with CPU, or a circuit is overloaded or an interface is overloaded. Itís very good information. But what if you donít even know the path that your packets are taking? Youíve got to know the path your packets are taking. If you know your path, then you can exploit device status, to find out what devices on your particular path are impacted by memory, by CPU, by other types of capacity limiting characteristics. The first thing is, youíve got to know your path. Well, there are not very many tools out there or capabilities at layers two and three. You can give a layer three path, but itís hard to get a layer two path as well. Youíve basically got the network flows. Youíve got device status, and then third, response time. Response time monitors, thereís several different flavors. AuthNet has some really nice stuff. NetQoS, which was acquired by CA, has some nice stuff. There are a couple of other players in the field that I donít have a lot of confidence in quite yet, but itís growing. The performance marketplace is no longer about red, green, up, down. Itís about, ëitís slowí, ësomethingís wrongí, ëwe donít know where, itís working, but not very wellí and ëitís impacting our end usersí. So you put these response time monitors, which basically capture the packets or listen to the packets as they go into the server. They timestamp it at the server, and they say this packet response time took 100 milliseconds. You can put timers on these things and thresholds, so that if your response time is slow, you can automatically trigger a trace file. You donít have to wait until 2:00 in the morning. You can actually have your response time monitor over monitoring a bank of servers, and when the response time gets slow it automatically triggers a packet capture, so your technologists can go back in the next day and do retrospective analysis on the packets that were slow. So do we care if everything is fast? Do we want to capture all those packets? No. What do we want? We want to trigger on the events that are poor, and thatís what these systems do. They record the response time, itís an awesome system. It also gives you the network round trip, because the [inaudible 26:54] come back and it records that so you can see. One of the problems that we had was satellite delay and multiple router hops adding satellite delay. Itís about 700 milliseconds round trip across a satellite. Well, when I see a response time that says itís 2.1 seconds, I know that that 2.1 seconds is because theyíve gone across three satellite hops at three times 700. You can figure out and start to surmise whatís going on, and then you can look at optimizing your routing. These are your tools and your key issues. Youíve got your performance management system. Youíre getting your performance metric, the feedback, and your business metrics. Youíre getting your trouble tickets; youíre getting your information back. Some of the things that you need to look at are route analytics, to identify instability in routes, server response time, net flow base, and rate and volume information. It gives you hints on conditions limiting user experience. Also look at SMNP device status. Well, youíve got to know the path in order to know what devices are in that path, and because you may have hundreds or thousands of different devices. Knowing what your particular path is and exploiting that path, to see if anything in that path is failing. Those are value added steps. Then of course packet capture at key locations, so you can get down to root cause analysis. Hereís a network management architecture diagram. Thereís a portfolio of tools. Youíve got Remedy Trouble Tickets; youíve got Opsware, ArcSight. Youíve got a large number of different systems-Packet Design's route analytics, NetQoSí response time monitoring, and NetFlow tools, OPNET's Network Diagramming capabilities. Thereís a portfolio. Itís a Swiss army knife, all working together. You canít use just one company or one suite of products. You can reduce that set as much as you possibly can. But you always want to make certain that youíve got every best of breed and aspect of your system. This is where we went out and we installed systems at 15 different sites across Afghanistan, Iraq and areas in the United States so that we could have NetFlow and response time monitoring. This is how long it took us. It took us one year to install. This was pretty much unheard of, and the ability to get this system up and running and actually do analysis on major applications in fewer than 12 months, inside a war environment, was unheard of. We had an incredible team down at CENTCOM out there. The war fighters, man, they were working their buns off to try to help us get all of this stuff, so we could figure out what was wrong with this biometric application. This biometric application by the way takes fingerprints and iris scans. And it looks at latent prints of the guys who were building bombs and that sort of thing, those IEDs. That application was helping them put those guys on an alert list. A war fighter would have a bad guy walk through, or people would be walking through their security station going into Fallujah or into Baghdad or whatever. And they would immediately pick up on the fact that thousands of people were coming through, and we just found the guy who had a latent print on an IED. This guy is now caught. Thatís how the surge actually worked in my opinion, was because we had electronic means to zero in on whoís a bad guy and whoís in proximity and whoís not. This application and the analysis of and the fixing of this application really helped the effort over there. We found route changes, due to packet loss, slow server response time. We found problems with TCP offload engine. TCP offload engine is where the TCP stack is moved from the server, inside the server, using the serverís CPU and memory, off onto the network interface card. There was some problems, and Iíll show you a couple of slides here on some of the examples of this sort of stuff. Of course, this took many weeks and months of analysis after we had all this stuff put together. But we found all these problems and started mitigating them, and things improved. And that application I was telling you about was even more effective in performing. Across the network there are qualities of service incongruities. Network and application issues requiring packet level capture, we analyzed a lot of stuff. Hereís a few screenshots of some of the stuff. Hereís some route change analytics going on. Hereís some satellite retransmission, took three and a half seconds to retransmit across many of those satellite circuits. This is a detailed analysis of that sort of stuff. Hereís a TCP offload engine recovery issue. These are all commercial, off the shelf stuff. In the U.S. or other parts of the world, these particular systems are not always found to be wanting in a low latency environment. But you put them in a satellite environment with high latency, and manifestations of severe problems start to show up. That wouldnít show up otherwise. We found TCP offload engine problems. We also found WAN optimizers and load balancers and other things, doing weird and funny things. I called them our own man-in-the middle. I developed a method by which we could identify that man-in-the-middle, and this is some of the analysis associated with that. Then processing analysis.- How long it takes? Over on the left, you see processing, processing, processing. That was before we optimized the application. Over on the right, the same transaction only took under two seconds instead of 18 seconds. You add those types of optimizations up and youíve got a lot of optimization. Then there was data duplication. What does data duplication mean? Data duplication is when you have a problem with that TCP offload engine where youíre sending the same data multiple times. This is an example of where the data is being sent across the network. So not only did you have clogged pipes, not only did you have networks that were saturated, but now youíre sending multiple copies of the same data. Here is an example of that packet loss induced TCP offload problem, and the wasted bandwidth associated. Youíll see a red line and a green line. A red line was the wasted bandwidth; the green line is what was normally required. But in this particular environment, because of that TCP offload engine and the packet loss that was associated with it that triggered it or was the catalyst. See, if you had no packet loss, a lot of these problems donít manifest. When you do have packet loss you end up with three and a half second response times. You end up with data duplication. There are a lot of things that are exacerbated, because in a war environment you are not dealing with perfect commercial communication systems. So bottom line. Multi-tier identification of your environment; you have to know and document your network, and then you set your test points up. Your front end tier, your middleware tier, your SQL tiers; you have to know how to find and design your network management and your response time monitoring and NetFlow systems. In order to do that, you have to document very effectively. Then you can instrument your front end, your middleware tier, your back end tier, and your mainframe tiers. You can then take and instrument your systems. This is basically showing test point two, where we pulled over to a super-agent to do response time monitoring, with the ability to capture those packets like I talked to you about, when response times get slow, it would automatically capture packets at that location. Pull it all together, what do you have? A user clicks on their screen on the top left. You see where it says ëuser click?í Then it comes down. Over on the right, the different colors, red, green, blue, those represent processing times, network serialization, transport and switching queue times. Then, of course security authentication- How long does it take to authenticate the user to be able to access that data? If you take a gander up at the top left, he clicked, and on the bottom right the user finally gets the information. Did that take 15 seconds or 25 seconds or 71 seconds, what have you? Then basically take it with the next step, and you bring it down to the various tiers. The process, where is it? We fixed the process problem in the application, reducing the response time. This is an example of what some of those response time monitors look like over there. Youíve got your web tier monitor at the same time as your app tier, your SQL tier, your mainframe tier and we found the process at the app tier that was causing a lot of issues, and we helped resolve those. I know Iím going just a couple minutes over today, but I just wanted to finish this out. The bottom line is, in order to find that needle in a haystack, in order to make certain that your users are productive; you have to design your systems not to find problems necessarily, but to nurture and build a system whereby you are obviating the problem. In other words, you are fixing the problem because you have such good intelligence on the information. In ëThe Art of War,í the first thing thatís said, is know yourself. Know Yourself. Document. Know Yourself, Monitor Yourself. Know where you are, so that when you do have a problem, you can go over and you can find that needle in a haystack, using scientific, automated capabilities. The future of network-centric warfare is dependent upon having these types of capabilities. Just buying a bunch of stuff and installing it is not what is needed. You have to have architects who are looking at this, who understand the big picture, and who can solve problems in all manners, from client to server and all points between. Anyway, if thereís anything we can do to help you; we know how to do it. And weíre at your service. It was good being with you today. This is Bill Alderson, signing off. Appreciate it.