Birds-Eye.Net
All things broadband and more...
 
Web Birds-Eye.Net
What's New?

Download Purchased Items

Research:
Analysis
International

Reference:
Acronyms & Definitions
Articles
Broadband Directory
Legacy
Operations
Technical
Yearly Predictions
> RSS Feeds <

Business Forms:
Due Diligence Checklist
Funding & VC Due Diligence
Real Estate Due Diligence

Resources:
Monitoring/Reporting/Benchmarking
Patent Harvesting Kit
Ready to Use Scripts
Source Code

Referral:
Expert Consulting
Referral

Other:
Advertise With Us
Feedback
Recommended Reading
Fishing
House
Baby in the City
Blog

Non-Invasive Network Management
A system of processing device health reporting using an extension of SNMP trapd

By: Bruce Bahlmann - Contributing Author (your feedback is important to us!)

Created: March 24, 1999

Note: For help designing/implementing your non-invasive network management program or developing tools to help you improve or implement such a program contact Birds-Eye.Net.

Management Summary:

A broadband operator’s current visibility into the health of Hybrid-Fiber Coax (HFC) is extremely limited. Efforts to extend visibility have resorted to attempting to monitor customer equipment (such as telephony RSUs, cable modems, and personal computers) with traditional network management software. However, these efforts fail to achieve reliable detection of partial node outages and are ill equipped to scale with the growing number of subscribers without significantly impacting network performance. A new method of network management that provides a non-invasive approach to managing HFC happens to be the first such system that can provide complete end-of-line monitoring. This system is completely scalable, offers sophisticated location logic that enable one to quickly locate the most common types of HFC outages, utilizes a mere fraction of the bandwidth required by previous efforts, and will fully integrate into existing top-level network management systems without additional workstations or monitors.

Introduction:

Overview:

Today’s network management and troubleshooting tools rely on several different types of data to effectively troubleshoot devices on a dynamic network. These data sources include such things as Subscriber Management Systems (SMS), Operational Support Systems (OSS), and on-line databases that contain information critical to the activation and operation of customer devices (cable modems and customer computers). Unfortunately, managing and troubleshooting devices in a dynamic realm involves creating relationships between these different data sources and then presenting the product back to various users of the system (with varying levels of detail). Maintaining these relationships as well as enabling them to present a total picture of the device in question is extremely resource intensive. In fact, as the volume of customers increase this process begins to effect the quality of service this data supports (e.g. increases load on provisioning systems as well as increased traffic on the network).

Non-Invasive Network Management (NINM) exploits use of a listening application also known as a Network Status Collector (NSC) and several Network Status Generators (NSG). Rather than actively polling all the known network devices and searching through various data sources on-demand, a NSC consisting of a listening application paired with intelligent analysis techniques and a collection of NSGs can provide unprecedented network management for HFC. Interpretation and reporting of the relationships assembled by the NSC paint a picture of the HFC operational landscape and assist plant operations personnel in surgically correcting potential problems or failures. The use of NINM centers on the following goals:

  • Provide access to real-time network data with little or no impact on OSS and network traffic
  • Provide customizable reporting options with different exporting capability (email, pager, web, news, chat, etc.)
  • Provide a standard API along with Open Database Connectivity (ODBC) support for integration with other network management solutions
  • Provide the ability to learn "normal" operation and that which falls above and below it
  • Provide the ability to localize potential problems
  • Provide the ability to display varying levels of detail

Use of NINM thus provides real-time data analysis and the ability to break down any problem quickly and without using large chunks of network bandwidth or impacting OSS.

Background:

The reliability of any network management system is directly dependent on the extent that it reaches out to all of its components. Customer Premise Equipment (CPE) and cable modems (CM) represent an attractive option for network operations (NetOps) organizations to reach out to all active components and verify HFC availability. The installed customers’ CPE and CMs that are randomly placed throughout the network, represent no additional cost, and provide increasingly useful information to traditional network management software. However, traditional network management software relies on actively polling devices to collect operational status of the network where these devices reside. Thus, current efforts to use CPEs and CMs along with traditional network management software fall short of the mark because they rely on active polling of these devices. The method of active polling mainly suffers from a scalability issue along with the unpredictable nature of a customer-controlled device. Essentially, the sheer numbers of CPEs and CMs reach a point where it impacts the frequency that a single application can poll them on a regular basis within a timeframe that is worth while. Another problem with active polling is that unless similar devices are polled together (or within a reasonable time frame) the information gathered is useless. For example, all CMs on a network can span several HFC nodes. Unless CMs are polled by-HFC-node and within a reasonable timeframe the information gathered may only indicated that something is potentially wrong with one of HFC nodes. However, since none of this information can be collated by-node, the resulting data is unreliable and only marginally useful. NetOps thus resorts to managing only a subset of CPEs and CMs, if at all, so they can only look a small controlled sample of the customers. However, as their core management applications are best suited to manage devices with static addresses (and are not capable of managing dynamic devices) NetOps re-map the IP addresses of these devices with every renumbering of the network.

Network bandwidth is also a concern of NetOps when it comes to managing large numbers of devices. Too much management (or active polling) can impact their ability to manage other (perhaps more important) network devices like routers, backbone links, etc. When network management is impacting performance one is left with the following options:

  • Increase the network capacity
  • Reduce the polling frequency

The need to increase the number of managed devices along with reducing the amount of bandwidth required to do so up until now has been hampered by traditional monitoring methods and software. However the main reason for attempting to monitor multiple CPE and CM devices is to see further into the HFC network. If one could truly monitor end-of-line one would be able to achieve the best possible HFC network management visibility without being required to perform round-robin active polling of CPEs and CMs.

Non-Invasive Network Management (NINM) Architecture:

Design Goals & Hypothesis:

The goal of the NINM is to act like a giant information collector. Information generated from the field (HFC plant) is directed to the NINM that in turn reads, processes, and stores this data for follow-on analysis. In order to perform these tasks, the NINM must be divided into two components. One component should perform the "raw" data collection work quickly and efficiently (with little or no analysis of the raw data collected). The second component performs detailed analysis on the raw data, creates necessary relationships, and stores this data for off-line reporting and further analysis.

The NINM should historically track information from devices by source. The number of historical instances by source must be adjustable.

The NINM should keep summary statistics (by source) on such things as first contact, instances, and total time between instances. Summary statistics must be organized by day, month, year, and overall. This information will enable follow-on analysis to make "rough" predictions on which devices it should hear from next and when (approximately). One could also use this information to determine whether a device is operating within, below, or above some average.

Proposed Architecture:

The operational components of NINM may consist of something similar to that shown in Figure 1.0. In this figure, NSG represents the farthest element located out into the HFC. The NSG is responsible to reporting information back to the collector in a timely fashion. The NSG is typically located at the end-of-line or at the last tap. One implementation may place a single NSG at the end of each mainline downstream from the last active component (i.e. a single NSG per HFC node). This implementation would provide end-of-line monitoring for the most likely components to fail. Another, more complete implementation would place a NSG at the end of each line* (multiple NSGs per HFC node). Since each fiber node may have several end-of-lines (branches off the main line), this would result in a more complete end-of-line monitoring however at the cost of many more NSGs.

* Note this implementation brings to light an unfortunate fact about HFC design practices. HFC is designed around "actives" such that each and every reference to a physical location on HFC is relative to its closest active. A network management system capable providing visibility down to every end-of-line requires that each branch of the HFC be labeled. This would allow the NSC analyst to pinpoint the branch where the problem is located – no other HFC network management system can claim this level of accuracy. However, in order to achieve this level of visibility one would need to further label HFC nodes such that all branches were identified.

The Regional HSD Network would represent various connections between hybrid fiber coax (HFC), local area networks within distributions hub and headends, wide area networks (WAN) between headends and distribution hubs. Essentially the Regional HSD Network represents the entire distribution system for delivering HSD. The NSC represents one or more computers tasked with providing the "collector" functionality. The NSC may use single or a cluster of computers located in a central location in reach of the Regional HSD Network to collect its information. A more costly distributed architecture is also possible depending on the capability of the NSC’s database. In this case, the NSC may be located within larger distribution hubs or headends to collect the information and minimize WAN traffic. The NSC-SC is a variant of the NSC that performs speed checks. This functionality allows the various HFC links to be periodically tested to link speed. The results of these speed checks would be deposited in the raw datastore.

fig1-0_ni_net_mgt_doc.gif (4600 bytes)

Figure 1.0 NINM Components

The raw datastore represents the NSC’s primary storage facility. The raw datastore is a high-performance database optimized to handle many different types of write transactions. Here information collected by the NSC resides to save state for the NSC regarding each NSG it tracks. This data is time-sensitive and provisioned with a physical HFC location to allow other applications to report overall system status. NSC analysis* is done on the raw datastore to determine alarm events and suggest possible operational problems before they become outages.

Possible Database Enhancement:

If individual database records could age, and one could set up a trigger on certain age thresholds, the process searching through individual records for devices that are past due could be avoided. This would nullify a the possibility of a remote scalability problem associated with the NSC analysis function when dealing with extremely large numbers of NSGs. In addition if certain thresholds (greater/less than some global number) could be placed on database values this could also trigger alarm events. If this were possible, the NSC analysis could instantaneously react to triggers rather then resorting to finding them on its own.

The NSC analysis unit performs many functions including verification, status processing, and data warehousing. All events (alarm and potential operational deficiencies) are sent through a verification process where by the status of the NSG is confirmed via an active polling method. Active polling is necessary to ensure that an escalated state was properly detected. If not, the NSC analysis unit attempts to figure out why this event was improperly escalated either by correcting the raw datastore record, or alerting an admin to a possible configuration error (e.g. improper thresholds triggered an alert). Once verification is completed, the NSC analysis updates the raw datastore and copies the updated record to the reporting datastore. The NSC analysis also regularly updates non-escalated records from the raw datastore to the reporting datastore to ensure that this database mirrors as close a possible to the raw datastore.

The reporting datastore serves as the NINM system’s data warehouse. The data warehouse allows ODBC access so network operations personnel can access raw data about the HFC for reporting and trend analysis. This type of open access is preferred over offering some custom interface with a limited scope of functionality. The reporting datastore does however provide some defined functionality such as a basic API for use with standard Top-Level network management systems, an alerts interface for escalation of alarms, and a web interface for exporting various levels of information on internal web sites.

The alert function provides interfacing to a trouble ticketing system for escalation of events to create a trouble ticket for tracking purposes. This interface may also be used to send a page to the on-call pager specifying the type of problem, its physical location, and the trouble ticket number it has been assigned.

The web server provides multi-level access to the information stored in the reporting datastore along with on-demand access to individual records stored in the raw datastore. This allows the web server to provide troubleshooting to network operations personnel as well as technicians and plant operations personnel in the field. Multi-level access allows individuals to drill down to the level they desire (See Figure 1.1) to determine the source of some problem or verify a fix. The multi-level access also allows web sites to refer web to other web sites perhaps located at the enterprise level or other regional offices. This allows one to traverse any regional broadband system within the internal network.

fig1-1_ni_net_mgt_doc.gif (3974 bytes)

Figure 1.1 Drill-Down Hierarchy

In summary, the NINM system provides end-of-line monitoring visability while integrating seemlessly into a top-level network management system. Ideally, the NINM system would not require dedicated network management systems in the network operations center. Rather, the NINM system would provide element manager that could be loaded into the top-level network management system to gain access to the information contained in the reporting datastore.

Network Status Generator (NSG):

The NSG could be nothing more than a cable modem housed in a small enclosure (perhaps similar to a filter). Certainly, more elaborate versions of the NSG could also be built within the same basic operational framework as technology further reduces the functionality of a cable modem onto silicon. The NSG would be plant powered (similar to telephony units) and be able to connect via an RF fitting to a tap without restricting other drop connections to dwellings served by the tap. The NSG would be provisioned like any other cable modem however it would utilize a different boot file to configure its operational behavior and require that its physical plant location be provisioned along with its Media Access Control (MAC) address. The physical location information would be stored in the NSC’s raw datastore to allow any analysis to collate implementations where several NSGs are deployed on a single HFC node. This information would also allow alerts to direct plant operations personnel to the exact HFC node (or node branch) in question.

The NSG is mainly a chatterbox. Meaning all it does is send timely messages about its operational status to some IP address specified in the NSG’s boot file. These messages may contain some/all of the following information:

  • IP address of the NSG
  • Nework and Subnet mask of NSG
  • Tx/Rx power levels
  • File a file of specified size to perform periodic speed testing
  • Time file is sent (in UTC time)

The NSG may receive the following configuration parameters via its DHCP request:

  • IP address
  • Gateway IP address
  • Subnet mask
  • TOD server IP address
  • TFTP server IP address
  • TFTP boot file name

The NSG may receive the following configuration parameters via its boot file:

  • IP address of NSC and port* (optional)
  • NSC callback interval (in seconds)
  • NCS callback type (udp/tcpip)
  • Retry interval (in seconds) for use with TCP/IP callback type only [0 = disable/none]
  • IP address of NSC-SC and port* (optional)
  • NSC-SC file upload interval (in seconds) [0 = disable/none]
  • Recycle interval (optional) [0 = disable/none]
  • Lower reporting band frequency for use with recycle interval [0 = disable/none]
  • Upper reporting band frequency for use with recycle interval [0 = disable/none]
  • Defining a port may provide ways to run multiple instances of the NSC/NSC-SC and distribute the load of answering all the NSGs across various NSC/NSC-SC instances.

The preferred implementation of the NSG is that it would send a User Datagram Packet (UDP) to the NSC as a means of reporting its operational status. The reason for using a UDP type message would mean less work for the NSC as it would not have to acknowledge the receipt of the packet. This however could create problems if the NSC was insufficiently powered or clustered. In these cases the use of Transmission Control Protocol/Internet Protocol (TCP/IP) could be selected which would provide a NSC acknowledgement of a received NSG message.

It is important that the NSC process an extremely high percentage of the NSG messages because the absence of a particluar NSG received message is interpreted by the NSC analysis as a potential outage. Although the NSC analysis is equipped to correct for dropped messages (through its verification process) the efficiency of the NSC in handling a high percentage of the NSG messages (or all of them) means less active polling is initiated by the NSC analysis to verify potentially unreachable NSGs.

The NSG’s file upload function is required to construct a sequence of packets to form a file large enough to reasonably measure the file upload interval. There are two ways this could be implemented. The first requires the NSG and the NSC-SC to maintain a highly precise time (at least in the 100ths of seconds). If both the NSG and the NSC-SC synced time off the same TOD server, a smaller file (or single packet) could serve as the basis for the test. The second implementation still requires both the NSC-SC and the NSG to be synced to a TOD server. However, they would work with a larger upload file that would take long enough to calculate a reliable transmission speed.

The NSG has the capability to perform a scheduled software reset. During the reset operation, the NSG will perform a normal boot (i.e. perform a DHCP Discover, download TFTP file, and configure itself for normal operation). If the boot file has a recycle interval defined (other than zero) and the upper and lower reporting bands are something other than zero, the NSG will go off-line and perform a frequency scan from the lower to upper range defined. This information will be saved in memory and sent to the NSC upon completion of the next callback interval. The information saved will be in the form of key-value pairs associated with each frequency and power reading of the range defined. If no range is defined only the transmit and receive frequency are saved.

Network Status Collector (NSC):

The NSC represents a robust multi-threaded listener capable of handling multiple simultaneous requests. The NSC’s job is to quickly and efficiently handle incoming messages sent from NSGs. Since the NSGs are capable of sending to different instances of NSCs (running on different ports) the NSC could represent one or more (e.g. a cluster) similar applications all tasked with handling each and every NSG message.

The NSC saves information contained in every NSG message received in its raw datastore. Unprocessed or dropped NSG messages must be corrected by the follow-on NSC analysis process which verifies detected problem states before escalating them.

Network Status Collector – Speed Checker (NSC-SC):

The NSC-SC processes speed checking files sent to it by NSGs. Since this process could potentially impact bandwidth the NSG would not perform this task as frequently as sending status update messages the NSC.

The NSC-SC processes files sent to it by noting the time it was received from the NSG, determining the size of the packet or message, and then calculating the speed of the transmission. The speed of the transmission is calculated using the following formula:

Transmission speed = size of file or packet / (time received by NSC-SC - time sent from NSG)

This information is then saved in the raw datastore. Having this information allows the first ever tracking of HFC transfer data rates in the all important return path for each and every node in a system.

References:

 "Client Experience Monitor", Draft v0.6, Bruce F. Bahlmann, 9 September 1999.

"Guide to HSD Service Interruptions", Draft v1.0, Bruce F. Bahlmann, 30 December 1999.

"Broadband Network Management Requirements for HSD", MediaOne Draft v1.0, Bruce F. Bahlmann, 17 January 2000.

Can Birds-Eye.Net help you or your Company?
Receive your Birds-Eye.Net articles and white papers hot off the presses by adding our RSS feed to your reader.

(C) Copyright Birds-Eye.Net, All rights reserved.
It is against the law to reproduce this content or any portion of it in any form without the explicit written permission of Birds-Eye Network Services, LLC. Federal copyright law (17 USC 504) makes it illegal, punishable with fines up to $100,000 per violation plus attorney's fees.