Welcome to the New Year - 2011!
At the start of a new year, there is always talk of new year resolutions. You've seen them in magazines and newspapers: eat healthier food, loose weight, begin an exercise program, tackle the CCIE test, learn a new technology, clean up the network. I can't say that I've seen the last three items in the popular press, but they are still good resolutions. There's an overabundance of advice for the first few items, and lots of material on the 'net about the CCIE test and about networking technology, so let's talk about running a clean network and what that means.
A clean network has fewer operational problems and when a problem occurs, it is obvious. I force a problem to be more obvious by eliminating all the similar instances of the problem. I no longer have to look through a long list of similar problems that are unimportant. When something is broken, there's generally only a few instances and I know that all of them need to be fixed before the network is operationally correct. Therefore, I have a list of a few things that should not appear in a clean network:
1. Configuration not saved
2. Router interface down
3. Switch Trunk interface down
4. HSRP/VRRP/GLBP group with only one router
Note: See the Network Health Metrics posting for a graphical display of the network issues that exist.
Starting with configuration maintenance, the NMS should notify me of any network devices whose configuration has not been saved to NVRAM. What will happen to its configuration at the next power outage? A little more information is important to answering this question. The Cisco IOS contains an SNMP value of sysUpTime for the time that the config was last saved and the time that it was last changed, stored in ccmHistoryRunningLastChanged and ccmHistoryRunningLastSaved. Unfortunately, just entering config mode will cause the ccmHistoryRunningLastChanged value to be updated, indicating that the config was changed. Even if all you did was enter config mode, issue a few '?' help commands, and exit, then IOS still thinks you made a change. So the NMS must be smart about determining whether a real change exists. When it detects a potential configuration change, it needs to download the new configuration and compare it with the previous configuration. If they are different, then it should report a configuration change.
The next three items on my list are all important for networks that contain redundant links. At Netcraftsmen, we're finding that redundant networks are more commonplace. Regular business operations depend on the network and any network outage has a negative impact on the business. So it is important to know when a redundant component has failed. Let's say that your network has 100 router interfaces that are admin up, but are operationally down (in what I'll call "up/down" state). If you have a failure in a redundant router link, the count of up/down interfaces will increase to 101, which is more difficult to notice than if the count went from 0 to 1. The failure to notice means that the other link in the redundant configuration will evenutally fail, creating an outage. Contrast that with a network in which there are normally no router interfaces in up/down state (any interfaces that aren't connected are configured admin down; i.e. "shutdown"). When a link fails, an issue is created that alerts you to the fact. Note that using syslog or SNMP Traps isn't sufficient because they are typically delivered via UDP (the "Unreliable Datagram Protocol" ;-)). By doing a 'shutdown' on all router interfaces that aren't used, you have a 'clean' set of router interfaces, and it becomes much easier to spot any failed links.
The same principle applies to switch trunking interfaces. If you shutdown or remove trunking configurations from all unused switch ports, then it becomes easier to spot failed trunk interfaces that interconnect your switches or that connect from the switches to servers that support trunking. In either case, you have failed interfaces that are required to support the business. By shutting down unused interfaces, you tell the NMS to ignore the interface in its analysis of the network's operational state. Any trunking interface that is in up/down state is then a failure that must be corrected.
The proper operation of the redundancy protocols, HSRP/VRRP/GLBP, is another factor affecting the reliable operation of a network. If there is only one router in the redundancy group, then one of several things happened:
1. The redundant (second) device hasn't been installed yet.
2. The redundant device was installed, but its configuration is incomplete.
3. The redundant device used to work, but failed and you've not noticed.
4. The redundant device is working, but the interface on which the redundancy is configured has failed.
The NMS could probably figure out each of these scenarios, but there's little value in going that far with the analysis. All that's really needed is to draw your attention to the fact that you have a redundancy protocol configured, but there's only one device in the group.
In the above three scenarios, you've paid to implement redundant devices and links. Your customers have a valid point when they ask why connectivity failures occur. Note: Try working in the brokerage industry where the traders can get very upset when the network goes down. By running a clean network, it is obvious when one of the redundant elements has failed. Identifying the failures and promptly correcting them allows you to spend less time fighting fires and more time working on productive, and potentially more interesting projects.
I can hear someone now: "But we hundreds of interfaces in up/down state! Do you expect us to research and correct them all?" Yes, I do. Tackle a few each work day and you eventually take care of them all. You may even find redundancy failures in the process. If you have a lot of WAN links, you'll probably also find some links that are still in service, but the remote connectivity is no longer needed. Turn off the service and you'll save money.
If you use the interface tagging mechanism I've described previously, you can easily use the NMS to create interface groups that give you better control over the analysis and reporting of interface state and performance. Think of having one interface group that's WAN links and another group that's trunking links between switches. Or you can do grouping by region. Use grouping capabilities to sub-divide the network into more managable pieces.
If you've not already surmised, NetMRI checks for all of these items and supports device and interface groups. Other NMS tools have similar capabilities. Configure the NMS to send a daily or weekly report on the number of instances of each issue. Ideally, you would create a Network Health Metric plot that shows the quantity of each issue, allowing you to track the number of instances over time and to quickly determine if an issue exists.