Network Management Operations - Solarwinds certified professional

Network management operations are an important part of the SCP program. This training video will introduce you to network management operations and explain the requirements of the SCP program on this subject. Within this section, you should expect to find things like device management, capacity planning, and information on service level agreement management. Enabling devices and enabling the network itself for management is an important part of managing any network. Some considerations you want to keep in mind when it comes to managing the network in enabling these devices for management by your NMS, or for instance the network management protocols that you will be using. And not only do you have to think about which protocols, meaning SNMP vs. IMCP vs. WMI, you also may have to think about which versions and types of those protocols, meaning SNMP version 1 vs. version 2, version 3 of course. Now in the detailed videos for network management protocols, we go over each protocol in-depth, so we won't do that again here. But you want to pay attention to that. You also want to pay attention to the options certain network management protocols have available. For instance, with Netflow, you have the option of using traditional Netflow as in Netflow Version 5, vs. flexible Netflow, which is implemented in version 9 and with an IP fix. Something else to think about when it comes to enabling devices and then enabling the network for management is Event Severity. You need to think about which types of events you want to send traps on, which different levels of logging you want to enable, so that you can control how detailed your syslog messages are, and also how many messages you receive. Now obviously if you set the lowest level of syslogging, then you receive a tremendous amount of messages, which can require a lot of disk space for storage, and can place a lot of load upon your NMS, so keep that in mind. Now with syslog, for instance, there are seven specific different levels that you can use in terms of the Event Severity, and you want to understand what those are and what the most detailed level is. Access lists and firewalls are another thing you have to know about when it comes to enabling the devices and the network for management. Not only do most devices require that you update that device itself in terms of an access list or a filter, limiting network management traffic, you also have to think about allowing those protocols through the network so it's not blocked en route to the device you are trying to pull. Another thing to keep in mind is you will have to think about your capacity requirements and how to determine those when it comes to scoping and scaling your NMS. You will need to know which statistics mean what. In other words, whenever you have a circuit that is overloaded, you will also want to check for those types of statistics under interface statistics, such as bandwidth utilization, you will want to look at the bit rates in terms of bits per second. You will also want to look at any errors and discards on the interface. However, when a device is being overloaded, you want to look at statistics like CPU, and memory, and buffer usage. Understanding which types of statistics to look for when you're doing capacity planning is really important. Now you also need to be able to understand how the statistics work and what they are used for. For example, SNMP would tell you how much traffic is present on a network interface, but NetFlow tells you who and what; who is using the bandwidth, what they are using it for, what websites they are visiting, what protocols and applications they have out. Now here is an example of how you calculate bandwidth utilization on a specific interface. So we are going to cover the octets in and out, bandwidth capacity, and the time span, meaning these are the three things you need to know in order to calculate bandwidth utilization on a circuit. You need to know the octet values in and out, you need to know actually the bandwidth or capacity for the circuit and the time frame you are talking about. Let's take this for example: assume you have a traditional T-1, which is 1.544 megabits per second. Now if you were to poll for the traffic going through it, typically that's done with either IF or iflnOctets. And typically, that would give you a value, again, back in octets, which is the same as bytes. So if I polled that interface, the first value I might get back would be a million. And if I poll it five minutes later and get a value of forty five million, then I know that in that five minute interval, that there were forty four million octets sent on this interface. Receive if it's in, or sent if it is out octets. Now to do the math to calculate the bit rate, what you will want to do first of all is multiply the forty four million times eight. That takes you from octets into bits. Once you have done that, divide that number by the number of seconds in between your polls. And in this case we had even number of three hundred seconds or five minutes, and we get a little over one million bits per second. Now divide that by a thousand to get kilobits per second, and by a thousand again to get megabits per second, and you see we are at about 1.17 megabits bits per second, divide that of course by the theoretical max which is 1.544 for a T-1, and you end up with 76% utilization. And that is effectively what your NMS is doing all the time. Now this is important because you want to be able to understand in the event that your NMS is unable to collect this data or the data that you are seeing in the NMS looks odd, you need to know how it works so you can do it manually, to really understand it. Now in some cases, people will divide by 1024 instead of 1000, convert from bits per second to kilobits and up to megabits, but traditionally if it's a bit rate, you only need to divide by 1000 or 10^3, and if it is a byte rate, or if it is measured by bytes, then you divide by 1024. You also need to be able to understand trends and how to recognize those within your NMS, specifically within Orion, and when and how to use the Orion report writer. Now we will cover the report writer in one of the detailed videos on Orion NPM administration, and of course you can read about in the Orion administrator's guide. Last but not least, you will want to understand service level agreement or SLA management, how to define and list common SLAs, when to use charting vs. reporting, and how to understand 95th percentile. Now common SLAs typically are delivered either by your service provider or your carrier, or there are SLAs you define internally to really give you a way of measuring the service you are providing to your internal customers. Now the two most common SLAs are around performance, for instance, you are paying a carrier for a 10 megabit metro ethernet circuit, you want to be able to track the fact that you are actually able to get up to 10 megabits under peak load time. If for instance you are paying for 10 megabit, but you notice in Orion you are never spiking above five megabits, even under peak load times, then they've misprovisioned your circuit, and they probably owe you some money back. You want to really monitor that and work with your carrier on it. Now the other common SLA is around availability. Availability simply means up time. And so on a lot of networks people like to talk about five nines of availability, and what five nines means is simply ninety-nine point nine nine nine percent available or five nines, which roughly equates to five minutes of down time per year. Now a lot of organizations really are managed, and a lot of engineers I worked with are bonused on availability, so you want to be able to track that and use both charts and reporting Orion to tell when and where you violated these SLAs. Charting is great when you want to look at a single interface or single entity, and you want to see the variations in the statistic over a specific amount of time. Whereas reporting is more valuable when you want to see many different entities for a set amount of time, and you want to see an average for the entire time period. Now of course 95th percentile we did a lot of questions about. It's actually a very, very simple method of taking out the highest peaks in your collected traffic to understand what your true max has been over time. And 95th percentile simply means you order all of your samples from the data collected from highest to lowest, you then drop the top five percent of those samples and the next sample down the list is your 95th percentile.