|
Non-Invasive Network
Management
A system of processing device health reporting using an extension
of SNMP trapd
By: Bruce Bahlmann - Contributing Author (your
feedback
is important to us!)
Created: March 24, 1999
Note: For help designing/implementing your non-invasive network management program or developing tools to help you improve or implement such a program contact Birds-Eye.Net.
Management Summary:
A broadband operators current visibility into the health of Hybrid-Fiber Coax
(HFC) is extremely limited. Efforts to extend visibility have resorted to attempting to
monitor customer equipment (such as telephony RSUs, cable modems, and personal computers)
with traditional network management software. However, these efforts fail to achieve
reliable detection of partial node outages and are ill equipped to scale with the growing
number of subscribers without significantly impacting network performance. A new method of
network management that provides a non-invasive approach to managing HFC happens to be the
first such system that can provide complete end-of-line monitoring. This system is
completely scalable, offers sophisticated location logic that enable one to quickly locate
the most common types of HFC outages, utilizes a mere fraction of the bandwidth required
by previous efforts, and will fully integrate into existing top-level network management
systems without additional workstations or monitors.
Introduction:
Overview:
Todays network management and troubleshooting tools rely on several different
types of data to effectively troubleshoot devices on a dynamic network. These data sources
include such things as Subscriber Management Systems (SMS), Operational Support Systems
(OSS), and on-line databases that contain information critical to the activation and
operation of customer devices (cable modems and customer computers). Unfortunately,
managing and troubleshooting devices in a dynamic realm involves creating relationships
between these different data sources and then presenting the product back to various users
of the system (with varying levels of detail). Maintaining these relationships as well as
enabling them to present a total picture of the device in question is extremely resource
intensive. In fact, as the volume of customers increase this process begins to effect the
quality of service this data supports (e.g. increases load on provisioning systems as well
as increased traffic on the network).
Non-Invasive Network Management (NINM) exploits use of a listening application also
known as a Network Status Collector (NSC) and several Network Status Generators (NSG).
Rather than actively polling all the known network devices and searching through various
data sources on-demand, a NSC consisting of a listening application paired with
intelligent analysis techniques and a collection of NSGs can provide unprecedented network
management for HFC. Interpretation and reporting of the relationships assembled by the NSC
paint a picture of the HFC operational landscape and assist plant operations personnel in
surgically correcting potential problems or failures. The use of NINM centers on the
following goals:
- Provide access to real-time network data with little or no impact on OSS and network
traffic
- Provide customizable reporting options with different exporting capability (email,
pager, web, news, chat, etc.)
- Provide a standard API along with Open Database Connectivity (ODBC) support for
integration with other network management solutions
- Provide the ability to learn "normal" operation and that which falls above and
below it
- Provide the ability to localize potential problems
- Provide the ability to display varying levels of detail
Use of NINM thus provides real-time data analysis and the ability to break down any
problem quickly and without using large chunks of network bandwidth or impacting OSS.
Background:
The reliability of any network management system is directly dependent on the extent
that it reaches out to all of its components. Customer Premise Equipment (CPE) and cable
modems (CM) represent an attractive option for network operations (NetOps) organizations
to reach out to all active components and verify HFC availability. The installed
customers CPE and CMs that are randomly placed throughout the network, represent no
additional cost, and provide increasingly useful information to traditional network
management software. However, traditional network management software relies on actively
polling devices to collect operational status of the network where these devices reside.
Thus, current efforts to use CPEs and CMs along with traditional network management
software fall short of the mark because they rely on active polling of these devices. The
method of active polling mainly suffers from a scalability issue along with the
unpredictable nature of a customer-controlled device. Essentially, the sheer numbers of
CPEs and CMs reach a point where it impacts the frequency that a single application can
poll them on a regular basis within a timeframe that is worth while. Another problem with
active polling is that unless similar devices are polled together (or within a reasonable
time frame) the information gathered is useless. For example, all CMs on a network can
span several HFC nodes. Unless CMs are polled by-HFC-node and within a reasonable
timeframe the information gathered may only indicated that something is potentially wrong
with one of HFC nodes. However, since none of this information can be collated by-node,
the resulting data is unreliable and only marginally useful. NetOps thus resorts to
managing only a subset of CPEs and CMs, if at all, so they can only look a small
controlled sample of the customers. However, as their core management applications are
best suited to manage devices with static addresses (and are not capable of managing
dynamic devices) NetOps re-map the IP addresses of these devices with every renumbering of
the network.
Network bandwidth is also a concern of NetOps when it comes to managing large numbers
of devices. Too much management (or active polling) can impact their ability to manage
other (perhaps more important) network devices like routers, backbone links, etc. When
network management is impacting performance one is left with the following options:
- Increase the network capacity
- Reduce the polling frequency
The need to increase the number of managed devices along with reducing the amount of
bandwidth required to do so up until now has been hampered by traditional monitoring
methods and software. However the main reason for attempting to monitor multiple CPE and
CM devices is to see further into the HFC network. If one could truly monitor end-of-line
one would be able to achieve the best possible HFC network management visibility without
being required to perform round-robin active polling of CPEs and CMs.
Non-Invasive
Network Management (NINM) Architecture:
Design Goals
& Hypothesis:
The goal of the NINM is to act like a giant information collector. Information
generated from the field (HFC plant) is directed to the NINM that in turn reads,
processes, and stores this data for follow-on analysis. In order to perform these tasks,
the NINM must be divided into two components. One component should perform the
"raw" data collection work quickly and efficiently (with little or no analysis
of the raw data collected). The second component performs detailed analysis on the raw
data, creates necessary relationships, and stores this data for off-line reporting and
further analysis.
The NINM should historically track information from devices by source. The number of
historical instances by source must be adjustable.
The NINM should keep summary statistics (by source) on such things as first contact,
instances, and total time between instances. Summary statistics must be organized by day,
month, year, and overall. This information will enable follow-on analysis to make
"rough" predictions on which devices it should hear from next and when
(approximately). One could also use this information to determine whether a device is
operating within, below, or above some average.
Proposed Architecture:
The operational components of NINM may consist of something similar to that shown in
Figure 1.0. In this figure, NSG represents the farthest element located out into the HFC.
The NSG is responsible to reporting information back to the collector in a timely fashion.
The NSG is typically located at the end-of-line or at the last tap. One implementation may
place a single NSG at the end of each mainline downstream from the last active component
(i.e. a single NSG per HFC node). This implementation would provide end-of-line monitoring
for the most likely components to fail. Another, more complete implementation would place
a NSG at the end of each line* (multiple NSGs per HFC node). Since each fiber node may
have several end-of-lines (branches off the main line), this would result in a more
complete end-of-line monitoring however at the cost of many more NSGs.
* Note this implementation brings to light an unfortunate fact about HFC design
practices. HFC is designed around "actives" such that each and every reference
to a physical location on HFC is relative to its closest active. A network management
system capable providing visibility down to every end-of-line requires that each branch of
the HFC be labeled. This would allow the NSC analyst to pinpoint the branch where the
problem is located no other HFC network management system can claim this level of
accuracy. However, in order to achieve this level of visibility one would need to further
label HFC nodes such that all branches were identified.
The Regional HSD Network would represent various connections between hybrid fiber coax
(HFC), local area networks within distributions hub and headends, wide area networks (WAN)
between headends and distribution hubs. Essentially the Regional HSD Network represents
the entire distribution system for delivering HSD. The NSC represents one or more
computers tasked with providing the "collector" functionality. The NSC may use
single or a cluster of computers located in a central location in reach of the Regional
HSD Network to collect its information. A more costly distributed architecture is also
possible depending on the capability of the NSCs database. In this case, the NSC may
be located within larger distribution hubs or headends to collect the information and
minimize WAN traffic. The NSC-SC is a variant of the NSC that performs speed checks. This
functionality allows the various HFC links to be periodically tested to link speed. The
results of these speed checks would be deposited in the raw datastore.

Figure 1.0 NINM Components
The raw datastore represents the NSCs primary storage facility. The raw datastore
is a high-performance database optimized to handle many different types of write
transactions. Here information collected by the NSC resides to save state for the NSC
regarding each NSG it tracks. This data is time-sensitive and provisioned with a physical
HFC location to allow other applications to report overall system status. NSC analysis* is
done on the raw datastore to determine alarm events and suggest possible operational
problems before they become outages.
Possible Database Enhancement:
If individual database records could age, and one could set up a trigger on certain age
thresholds, the process searching through individual records for devices that are past due
could be avoided. This would nullify a the possibility of a remote scalability problem
associated with the NSC analysis function when dealing with extremely large numbers of
NSGs. In addition if certain thresholds (greater/less than some global number) could be
placed on database values this could also trigger alarm events. If this were possible, the
NSC analysis could instantaneously react to triggers rather then resorting to finding them
on its own.
The NSC analysis unit performs many functions including verification, status
processing, and data warehousing. All events (alarm and potential operational
deficiencies) are sent through a verification process where by the status of the NSG is
confirmed via an active polling method. Active polling is necessary to ensure that an
escalated state was properly detected. If not, the NSC analysis unit attempts to figure
out why this event was improperly escalated either by correcting the raw datastore record,
or alerting an admin to a possible configuration error (e.g. improper thresholds triggered
an alert). Once verification is completed, the NSC analysis updates the raw datastore and
copies the updated record to the reporting datastore. The NSC analysis also regularly
updates non-escalated records from the raw datastore to the reporting datastore to ensure
that this database mirrors as close a possible to the raw datastore.
The reporting datastore serves as the NINM systems data warehouse. The data
warehouse allows ODBC access so network operations personnel can access raw data about the
HFC for reporting and trend analysis. This type of open access is preferred over offering
some custom interface with a limited scope of functionality. The reporting datastore does
however provide some defined functionality such as a basic API for use with standard
Top-Level network management systems, an alerts interface for escalation of alarms, and a
web interface for exporting various levels of information on internal web sites.
The alert function provides interfacing to a trouble ticketing system for escalation of
events to create a trouble ticket for tracking purposes. This interface may also be used
to send a page to the on-call pager specifying the type of problem, its physical location,
and the trouble ticket number it has been assigned.
The web server provides multi-level access to the information stored in the reporting
datastore along with on-demand access to individual records stored in the raw datastore.
This allows the web server to provide troubleshooting to network operations personnel as
well as technicians and plant operations personnel in the field. Multi-level access allows
individuals to drill down to the level they desire (See Figure 1.1) to determine the
source of some problem or verify a fix. The multi-level access also allows web sites to
refer web to other web sites perhaps located at the enterprise level or other regional
offices. This allows one to traverse any regional broadband system within the internal
network.

Figure 1.1 Drill-Down Hierarchy
In summary, the NINM system provides end-of-line monitoring visability while
integrating seemlessly into a top-level network management system. Ideally, the NINM
system would not require dedicated network management systems in the network operations
center. Rather, the NINM system would provide element manager that could be loaded into
the top-level network management system to gain access to the information contained in the
reporting datastore.
Network Status Generator (NSG):
The NSG could be nothing more than a cable modem housed in a small enclosure (perhaps
similar to a filter). Certainly, more elaborate versions of the NSG could also be built
within the same basic operational framework as technology further reduces the
functionality of a cable modem onto silicon. The NSG would be plant powered (similar to
telephony units) and be able to connect via an RF fitting to a tap without restricting
other drop connections to dwellings served by the tap. The NSG would be provisioned like
any other cable modem however it would utilize a different boot file to configure its
operational behavior and require that its physical plant location be provisioned along
with its Media Access Control (MAC) address. The physical location information would be
stored in the NSCs raw datastore to allow any analysis to collate implementations
where several NSGs are deployed on a single HFC node. This information would also allow
alerts to direct plant operations personnel to the exact HFC node (or node branch) in
question.
The NSG is mainly a chatterbox. Meaning all it does is send timely messages about its
operational status to some IP address specified in the NSGs boot file. These
messages may contain some/all of the following information:
- IP address of the NSG
- Nework and Subnet mask of NSG
- Tx/Rx power levels
- File a file of specified size to perform periodic speed testing
- Time file is sent (in UTC time)
The NSG may receive the following configuration parameters via its DHCP request:
- IP address
- Gateway IP address
- Subnet mask
- TOD server IP address
- TFTP server IP address
- TFTP boot file name
The NSG may receive the following configuration parameters via its boot file:
- IP address of NSC and port* (optional)
- NSC callback interval (in seconds)
- NCS callback type (udp/tcpip)
- Retry interval (in seconds) for use with TCP/IP callback type only [0 = disable/none]
- IP address of NSC-SC and port* (optional)
- NSC-SC file upload interval (in seconds) [0 = disable/none]
- Recycle interval (optional) [0 = disable/none]
- Lower reporting band frequency for use with recycle interval [0 = disable/none]
- Upper reporting band frequency for use with recycle interval [0 = disable/none]
- Defining a port may provide ways to run multiple instances of the NSC/NSC-SC and
distribute the load of answering all the NSGs across various NSC/NSC-SC instances.
The preferred implementation of the NSG is that it would send a User Datagram Packet
(UDP) to the NSC as a means of reporting its operational status. The reason for using a
UDP type message would mean less work for the NSC as it would not have to acknowledge the
receipt of the packet. This however could create problems if the NSC was insufficiently
powered or clustered. In these cases the use of Transmission Control Protocol/Internet
Protocol (TCP/IP) could be selected which would provide a NSC acknowledgement of a
received NSG message.
It is important that the NSC process an extremely high percentage of the NSG messages
because the absence of a particluar NSG received message is interpreted by the NSC
analysis as a potential outage. Although the NSC analysis is equipped to correct for
dropped messages (through its verification process) the efficiency of the NSC in handling
a high percentage of the NSG messages (or all of them) means less active polling is
initiated by the NSC analysis to verify potentially unreachable NSGs.
The NSGs file upload function is required to construct a sequence of packets to
form a file large enough to reasonably measure the file upload interval. There are two
ways this could be implemented. The first requires the NSG and the NSC-SC to maintain a
highly precise time (at least in the 100ths of seconds). If both the NSG and the NSC-SC
synced time off the same TOD server, a smaller file (or single packet) could serve as the
basis for the test. The second implementation still requires both the NSC-SC and the NSG
to be synced to a TOD server. However, they would work with a larger upload file that
would take long enough to calculate a reliable transmission speed.
The NSG has the capability to perform a scheduled software reset. During the reset
operation, the NSG will perform a normal boot (i.e. perform a DHCP Discover, download TFTP
file, and configure itself for normal operation). If the boot file has a recycle interval
defined (other than zero) and the upper and lower reporting bands are something other than
zero, the NSG will go off-line and perform a frequency scan from the lower to upper range
defined. This information will be saved in memory and sent to the NSC upon completion of
the next callback interval. The information saved will be in the form of key-value pairs
associated with each frequency and power reading of the range defined. If no range is
defined only the transmit and receive frequency are saved.
Network Status Collector (NSC):
The NSC represents a robust multi-threaded listener capable of handling multiple
simultaneous requests. The NSCs job is to quickly and efficiently handle incoming
messages sent from NSGs. Since the NSGs are capable of sending to different instances of
NSCs (running on different ports) the NSC could represent one or more (e.g. a cluster)
similar applications all tasked with handling each and every NSG message.
The NSC saves information contained in every NSG message received in its raw datastore.
Unprocessed or dropped NSG messages must be corrected by the follow-on NSC analysis
process which verifies detected problem states before escalating them.
Network Status Collector Speed
Checker (NSC-SC):
The NSC-SC processes speed checking files sent to it by NSGs. Since this process could
potentially impact bandwidth the NSG would not perform this task as frequently as sending
status update messages the NSC.
The NSC-SC processes files sent to it by noting the time it was received from the NSG,
determining the size of the packet or message, and then calculating the speed of the
transmission. The speed of the transmission is calculated using the following formula:
Transmission speed = size of file or packet / (time received by NSC-SC - time sent from
NSG)
This information is then saved in the raw datastore. Having this information allows the
first ever tracking of HFC transfer data rates in the all important return path for each
and every node in a system.
References:
"Client
Experience Monitor", Draft v0.6, Bruce F. Bahlmann, 9 September 1999.
"Guide to HSD Service
Interruptions", Draft v1.0, Bruce F. Bahlmann, 30 December 1999.
"Broadband Network Management
Requirements for HSD", MediaOne Draft v1.0, Bruce F. Bahlmann, 17 January
2000.
Can Birds-Eye.Net help you or your Company?
Receive your Birds-Eye.Net articles and white
papers hot off
the presses by adding our RSS feed to your reader.
|