Birds-Eye.Net
All things broadband and more...
 
Web Birds-Eye.Net
What's New?

Download Purchased Items

Research:
Analysis
International

Reference:
Acronyms & Definitions
Articles
Broadband Directory
Legacy
Operations
Technical
Yearly Predictions
> RSS Feeds <

Business Forms:
Due Diligence Checklist
Funding & VC Due Diligence
Real Estate Due Diligence

Resources:
Monitoring/Reporting/Benchmarking
Patent Harvesting Kit
Ready to Use Scripts
Source Code

Referral:
Expert Consulting
Referral

Other:
Advertise With Us
Feedback
Recommended Reading
Fishing
House
Baby in the City
Blog

Broadband Network Management Requirements for High Speed Data Internet Service

By: Bruce Bahlmann - Contributing Author (your feedback is important to us!)

Created: December 6, 1999

Note: For help designing your cable modem network management methods and procedures or developing tools to help you improve or implement such a program contact Birds-Eye.Net.

1.0 Introduction:

1.1 Overview:

The monitoring tools used today for high-speed data (HSD) Internet information service are nearly the same as they were 3 years ago. In addition, the deployment of Data over Cable Service Interface Specification (DOCSIS) introduced new monitoring requirements and complexity. The goal of this document is to develop a list of requirements which regions feel will enable them to effectively monitor the HSD service. These requirements are not meant to specify a full-fledged network management system (e.g. for video, telephony, etc.) – only requirements needed to monitor HSD services. For details regarding specific regional comments and/or HSD monitoring systems in use refer to the Notes from conversations with regional NOC staff document. Regional specific feedback included in these requirements are signified by the following:

1.2 Background:

Monitoring is a means of gathering and examining information about something in an effort to maintain service quality/reliability. The frequency and scope of information gathered depends greatly on what is being monitored and the way it is being monitored. There are several different ways of monitoring something including:

  • Watching for change (improvement or deterioration) in something, perhaps beyond certain thresholds
  • Watching for the absence or presence of something
  • Watching for some specific event or sequence of events to take place

Determining what to monitor requires an understanding of the "system" and what collection of items (or network elements) provides the best visibility into the inter-workings (or operations) of the system. Fault, Configuration, Accounting, Performance, and Security (FCAPS) represent a network management model used industry-wide to provide good visibility into the operation of most services. The FCAPS model (whose elements are defined in Table 1.0) recommends particular network management areas of focus that would enable one to maintain good service quality/reliability.

Management Type: Network Management Area of Focus:
Fault Monitoring network failures
Configuration Monitoring physical and logical inventory of network equipment
Accounting Monitoring how do you bill for services (i.e. telephony bills for call records)
Performance Monitor degraded network performance (i.e. dropped calls, blocked calls, snow on channel 2, etc.)
Security Monitoring network access and the security of the network (i.e. audit trails on UNIX servers)

Table 1.0 FCAPS Network Management Model

As a result of monitoring areas of focus (such as those in FCAPS) in the ways discussed previously, one might hope to trigger one of the following:

  • Proactive response – Information that would enable something that is deteriorating to be detected thereby allowing repairs to be made prior to failing thus eliminating potential service interruption.
  • Reactive response – Information that would enable something that is now broken to be detected thereby allowing repairs to commence independently of customer service calls.

The responses above are dependent on the proper configuration of all network elements within the network management visibility. If any network elements are not properly configured an erroneous response (or the lack there of) could result. The proper configuration of individual network elements enables higher-level network management to occur. Higher-level network management forms relationships between two or more triggered events allowing one to identify a probable cause – which may well be something that has not yet triggered. This level of analysis requires sophisticated network management software and, most importantly, extensive knowledge and experience of all network elements within the network management visibility.

The importance of FCAPS as it relates to HSD is that it provides the network management areas of focus required to sufficiently monitor the service.

1.3 HSD Monitoring Architecture:

There are several components of the HSD delivery system, which are critical to providing reliable service -- some of them are under our control (plant, headend) and some are not (customer’s home, servers, regional network, etc.). Before one can monitor HSD (or establish requirements for monitoring), they must evaluate various points in the system where service interruptions can occur and prioritize the effort that must be spent to effectively monitor it. The components of the HSD system (ranked in order of small to large in terms of the scale at which it may effect customers) are as follows:

  1. Customer premise
  2. Node segment
  3. Node wide
  4. Network
  5. Server
  6. Third party (not location specific)

Figure 1.0 describes the delivery system for HSD indicating possible locations of the service interruptions described above (for a more detailed explanation of HSD service interruptions refer to the Guide to HSD Service Interruptions document). The boxes drawn around groups of network elements represent their respective ownership.

fig1-0_bnm_req_doc.gif (6934 bytes)

Figure 1.0 High Speed Data Service System

If one were able to monitor in such a way that all these components were covered, the result would be comprehensive network management of the HSD service. However, today there does not exist any one-network management system that will provide this comprehensive network management. Therefore, we must evaluate the service interruptions above and continually focus our attention on progressively covering each component of the HSD service system until such time as all sources of service interruption are addressed (visible) by network management.

2.0 Goals:

The primary goal of this document is to derive a list of requirements needed to sufficiently monitor HSD services. Providing this visibility benefits other services such as analog video, pay-per-view (PPV), digital video, and telephony. This benefit results from all services sharing the same medium as well as the fact that what effects one service (on a shared medium) often effects other services. Requirements for monitoring HSD services have the following goals:

  • Focus on areas that are single points of failure and could effect large numbers of customers -- Certain areas have built-in redundancy and may not require immediate attention.
  • Monitoring requirements "should" be convergent services centric and inter-operate with existing network management system (don’t want a separate monitoring system for every service).
  • Try to buy off the shelf – do not want any custom built system – too many canned applications available – keep it simple [Atl]
  • Monitoring system should meet all the requirements of the customers of this system – These customers include Tier 1, Tier 2, field operations, plant operations, NOC, management, individual subscribers, and also corporate.

NOTE these goals may be satisfied in a phased fashion depending on relative priorities of the monitoring requirements and the availability of commercially software that meets a majority of these requirements. If a phased project is accepted, requirements with highest priority will be addressed first.

3.0 Requirements:

Requirements are organized in the table below in no particular order. Each requirement is stated in terms of its associated area, relative priority, phase in the project, and any assumptions that must be noted. Priorities noted from (0-critical, 1-high, 2-medium, 3-low). 

DOCSIS Monitoring Requirements Priority Phase Assumptions
1.0 Application      
  Highly available such that all components must be redundant 1 1  
  Reuse existing equipment where possible 2    
  Open architecture that permits bundling with other vendors databases (i.e. Oracle) 1 1  
  Multi-platform support for Solaris & NT (Xwindows) 2 1 Solaris before NT
  User interface for system will be standard to the users underlying operating system or will be web based. 1 1 Web based preferred
2.0 System Architecture      
  Each service component must have an element manager 2   Must support SNMP & have published open MIB’s and
  Ability to send information from element manger to a top-level system (Media Vantage, OSI NetExpert) 0 1 Must support SNMP trap forwarding/agent functionality
  Must communicate via TL1 (telephony standard) or SNMP – no proprietary communications between element, element manager, and top level system 1 1  
  Access to database (via ODBC) 2 2  
  We need to be able to set clears against attributes (ex. clear packet counters for inoctets, inerrors etc.)     Should be normal function of SNMP ?
  Needs to have a set of rules or knowledge base so it can proactively detect security, capacity, and network problems 2 2  
  Any HTML pages generated from application must allow for easy integration with other systems (e.g. key/value pairs) 1 1  
  Will provide access to the configuration files 1 1  
  Configuration files will allow basic functional environment and network element models to be standardized and imported 1 2  
  Will have the ability to export individual element model configurations for the purpose of enabling regions to share work-in-progress element models 2 2  
  Will have the ability to create generic escalation processes that can be triggered by referencing their index 2 2 Allows sharing of escalation events among regions
  Will be able to integrate with other network elements (i.e. ChetaNet/PathTrack which monitor headend-based or internal related problems – power, connectivity, etc.) 1 1 Required function of top level management system Not normally found in an element manager
  Will provide multiple interfaces to the same system that customers would use – perhaps only different views for customers as opposed to employees. 3 3  
  Would provide multiple security options – but would allow outside access (tunneling for example) 1 1  
3.0 Fault Detection      
3.1 Prioritization      
  Multiple outages must have the ability to be prioritized by customer count 1 1 Required function of top level management system, Not normally found in an element manager
  Multiple outages must have the ability to be prioritized by customer type (i.e. residential, school, commercial, etc.). 1 1  
  Escalated type of notification, sends to monitoring screen, depending on the rule if the condition is there after some period of time, email someone, if the condition is there beyond that, do something else (perhaps some action) 2 2 Required function of top level management system, Not normally found in an element manager
3.2 Alarm Handling      
  Alarm or event correlation 1 1  
  Event and/or alarm filtering – ability to filter alarms from devices that are the result (rather than cause) of an alarm 1 1  
3.3 Indicators      
  Will be able to poll Transmit/Receive power levels 1 1  
  Will be able to poll noise floor levels 1 1  
  Will be able to poll bit/error rates 1 1  
  Will be able to poll S/N ratios 1 1  
  Will be able to poll stay alive status (ping) 1 1  
  Will be able to poll length up time – percentage running 1 1  
  Will be able to poll packet loss 1 1  
  Visibility to router network – to know when routers are in trouble (i.e. CMTS, HUB router, etc.) 1 1 Or access to RR events related to this issue
3.4 Thresholds      
  Will be able to establish multiple thresholds for each network element and each polling characteristic 1 1  
Note(s) Standardized performance parameters (2000 ppe and under) – is rolled as an outage, 2000-10,000 ppe with correlated service calls is rolled as an outage, 10,000+ won’t alarm at that threshold      
3.5 Correlation      
  Which node is bringing the ingress/noise into the spectrum? Fault to the node level (is a problem when combining) 1 1  
  We need a standard naming convention to support the CMTS upstream to HFC node relationship. If I see a problem on card 3 - upstream port 2 on a CMTS, we need to know what HFC nodes are impacted to dispatch a fix agent and notify the call centers. 1 1  
  Must be capable of providing input to other service assurance tools 1 1  
  Fault level correlated to the common active utilizing Electrical Path Indicator 1 1  
  Any address information involved must be in two componets, house number and street address 1 1  
  will keep track of the number of active customers by CMTS 2 2  
  will keep track of the number of active customers by fiber node 2 2 Assuming a matrix of nodes to CMTS exists
3.5 Troubleshooting      
  We would like to see the tool have the ability to model a customer modem. If we have a customer that is having an intermittent issue, we model his modem and his neighbors in the tool. The tool uses SNMP to monitor the modems health. Now we have historical data that we can look at to determine is the a single customer or plant issue? Packet loss, Block/Sync loss or Fluctuating levels problem. 1 1 Periodic monitoring of customer modems for troubleshooting
  Will be able to discover devices on network level (a combined HFC segment) 1 1  
  Will be able to discover devices on node level (isolate individual upstream channels) 2 2 Assuming one can map fiber nodes to upstream ports
  Similar to the use of bridging tables to generate relative customer listing for spot checks 3 3  
  What percentage is working, what percentage is not working 1 1 May need to expand requirements
3.6 Individual Subscriber      
  will have the ability to model individual subscribers 1 1  
  will be able to look up subscriber address of individually modeled subscribers 1 1  
  will have the ability to display individual subscriber’s status (i.e. disconnect, active, pending install, etc 3 3  
3.7 Archiving      
  Will have the ability to selectively archive individual subscriber operational statistics for troubleshooting 3 3  
  Will be have the ability to archive operational network statistics for the purpose of capacity planning both HFC and backbone network 2 2  
  Alarms must be archived in order to permit one to look at trends in the data (in terms of time of day, frequency, etc.) 1 1  
  Any data collected can be archived by time of day and temperature 3 3 The sooner the better for leveraging historical data
3.8 Integration/Reporting      
  Will be able to customize reports (on the fly) 2 2 Will require additional information to implement
  Will be able to run reports combining modeled customer statistics and billing status – i.e. commercial accounts 2 2  
  Will be able to selectively generate customizable reports on similar types of network elements (CMTS, CM, etc.) 3 3 If common db is used should be doable
  Will be able to generate customized network availability – uptime statistics 2 2  
  Will be able to selectively export (at customizable frequencies) customized reports to a self refreshing web page 2 2  
  Will except external HTML templates for reports 2 2  
  Will be able to summarize Boolean data in terms of percentage 2 2  
  Will be able to summarize real numbered data in terms of average, low, and high 2 2  
4.0 Configuration:      
  Ability to clarify what is installed and where it is 1 1 Basic configuration/inventory management. The ability to query where a customer modem is located in the network by address La/Long etc.
  What the system tells us when it detects abnormal condition 3 3 Need to expand requirements
  Need for regional specific requirements or look and feel, should there be an event screen (need to post to a news group, when an event occurs then this information shows on the screen) 1 1  
5.0 Accounting:      
  Monitoring the box that communicates with the billing system gateway 1 1 Need to cross check to billing to verify that all active modems are active in billing
6.0 Performance:      
  Network congestion information (number of times you had collisions and had to resend) – true collisions (ethernet) if 20% of users are on-line at the same time there are going to be collisions – or bandwidth problems. 1 1 Great diagnostic feature.
  Want to be able to monitor HSD system – acceptable performance and un-acceptable performance 1 1  
  The ability to send data down stream and then up-stream – like a loop back feature on customer cable modems 2 2 The sooner the better
  Break down between single customer problem, node to home 2 2  
  Measure, check, threshold, and alarm it 1 1  
  How many times customer modems are rebooting (a.k.a. flap lists) 1 1  
  If a particular modem has dropped block sync more than some number (#?) over a 24 hour period 2 2  
  Round trip time (Throughput) – if we know the round trip time is 6 ms normally – then goes beyond some threshold 3 3  
  Only alerts on thresholds 1 1  
  Occasions where we want to tie in alerts with other systems 2 2  
  Will have the ability to do surveillance – performance perspective, if a customer is experiencing slow speeds, having tools available to allow customers to troubleshoot their own problem. 3 3 Need to expand requirements
Note(s) Can’t totally rely on customer modems because they are not plant powered      
7.0 Security:      
  Will have the ability to assign access and privileges to every object and process in the system 1 1  
  Will have the ability to audit every security event 1 1 Need to able to track a hacker
  Will have the ability to limited user access – must be able to determine who individual users are 1 1  
  Will have tractability – must be accountable 1 1  
  Will have ability to enforce password changes 2 2  
  Will have ability to maintain valid accounts 1 1  
  Will have ability to age accounts 1 1  
  Will support permission levels 1 1  
  Will provide customized logging of activity 1 1  
  Will support protected views that would allow view only access to prevent changes when being modified by another individual 3 3  
  Will support access via some secure method (tunneling or dialup) 1 1  
8.0 Support/Other:      
  Must include system and training documentation 1 1  
  Will be able to perform proactive alert to inform users of changes in availability – installers, sales, plant operations, dispatch, etc. 1 1  
  Alarms will have option to be audible 1 1  
  Ancillary or other additional output can interface with Microsoft Mappoint2000 (or equivalent) and display cable modems in their last polled state 3 3 All customer cable modems are polled continuously

4.0 Recommendations/Issues:

4.1 Customer premise monitoring:

The information gained from sampling customer cable modems in a specific area can provide us with information regarding the health of the plant in that area (see #1 in Figure 1.0). Unfortunately, unless this is done right it can also lead to inconclusive information. Attempts to use customer devices (i.e. customer cable modems or telephony NIUs) as a means of detecting partial node outages greatly depends on a high level of penetration of these services on fiber nodes. If we do wish to monitor the customer premise we should be aware of the following tradeoffs:

  • Customer premise monitoring allows us to utilize customer purchased devices (no additional hardware) as well as the same network management station used to poll other HSD resources in exchange for basic node outage detection.
  • Coverage for all nodes may be years away. Take a city of 30,000 with a penetration of 5% for HSD and 30% on telephony in an area where the max customers per node is 250. A total of 10,500 customers would have to be polled to determine partial node service interruptions on the city’s 120 nodes. This would mean that approximately 88 subscribers per node would provide sufficient information to enable us to detect partial node outages. While this is reasonable to expect good coverage across the whole node – most regions are no where near these numbers for both services (HSD and telephony).
  • Correlating polling information with physical location so that one can draw conclusions from devices on similar stretches of plant is also a problem. Polling customer modems in an area is further complicated in the DOCSIS model due to combining. This is because customers with the same network address or CMTS interface may all be on separate fiber nodes. Thus any polling that is done among them "may" not be usable For example, two down subscriber devices could mean that two node segments are down or only one is down (depends on how the plant runs through the community).
  • There is also an issue regarding the interval of polling. For example, if multiple modems were polled the period of time between modems geographically close to one another may vary. If this time varies too much the samples taken from the modems would have no relationship to one another.
  • There is also the subject of scalability in terms of polling an increasing number of NIUs and cable modems to determine partial node outages. Since this number is directly proportional to the number of customers it will require an increasing amount of resources to manage this system.

4.2 Monitoring standardization, coordination, and information sharing:

Ideally it would be nice if one could proactively fix everything such that customers would never experience a service interruption. Experience with equipment is one of the best ways that one can increase the number of "potential" problems that can be detected and then proactively fixed. This requires time and information sharing such that as new information is learned about potential problems it can be distributed to others so all can benefit. However, lightning does strike, things fail, and unknown problems develop. In these cases one must resort to a reactive response where one must locate and fix the cause of the service interruption as quickly as possible. Reviewing events leading up to reactive responses (during post-mortem meetings) allows network management staff to begin to understand "potential" causes of these service interruptions. These causes can then either be modeled in a lab for further examination or added to the list of things monitored in an attempt to detect this event in the future (hopefully prior to its "projected" outcome – a service interruption). An MSO's network management organization could greatly benefit from standardization, coordination, and sharing of information. Regions should align themselves with a standardized set of items they monitor in terms of each service. Each of these standard items would be well tested and understood (i.e. reliable means of monitoring). Additional monitoring items should be coordinated via national NOC such that all "new" items under development (or in testing stages) are tracked for degree of success. When an item has proven some degree of success it should be shared with other regions to enable more extensive testing/refinement. Perhaps after some qualification period it is added to the standardized list of items. The emphasis should be to provide each region with the same visibility of service monitoring. If a region does not have the staff to explore additional monitoring items (via coordination from national NOC), it will at least maintain the standard (but growing) list of items maintained by the national NOC. A decision should be made to coordinate the deployment and on going maintenance of a standardized HSD monitoring system. If a standardized system is not selected the size and scope of the project will increase.

4.3 Exposure to fiber cuts:

Regardless of our ability to manage equipment failures, one area where we are completely exposed is fiber cuts (see #3 in Figure 1.0). Often, many runs of fiber from the hub to the subscriber community traverses areas inaccessible to standard equipped plant operations personnel (i.e. underground wire or bridge crossing) or inaccessible due to some other reason (i.e. busy intersection). This fiber is also a single point of failure – meaning it is not redundant. These matters may be further complicated if a cut (break) occurs at night, poor visibility (fog, rain, snow, etc.), or when there is limited staff available to help find the break (i.e. visual examination is the current means of finding breaks). During these times service interruptions can be quite lengthy, as there is no means to direct service personnel to the specific location of the break. One possible solution would be to place an Optical Time Domain Reflectometer (OTDR) on fiber runs where we are sufficiently exposed. If we combine OTDR with GPS tracking of the route taken by the fiber run, it would allow network operations personnel to dispatch service personnel to the exact location of the break and speed repairs by as much as 50% (or the time it takes to locate the break). This item is not currently recognized as a requirement for HSD monitoring and should be considered if any standardized monitoring strategy is adopted. If this issue is not addressed as part of seeking an HSD monitoring system, one should at least devote resources to a comprehensive study of its impact and possible worst case exposure. The study would seek to determine the costs associated with fiber cuts in terms of loss of customer confidence, negative publicity, etc. Perhaps results would impact the way new cable systems are designed to minimize this type of exposure.

4.4 End of line monitoring:

Because customer devices will never represent 100% of the end of line for each and every fiber node we need to seek a more reliable measure of this important operational aspect. Knowing the status of ALL end of lines replaces other incomplete monitoring systems and this should be one area we spend much effort. End of line monitoring also represents a fixed number of devices that can be modeled by location. In addition, end of line monitoring would force a rework of the way we label nodes which would allow us to reference specific portions of a node rather than the existing system which keys on various actives (i.e. amplifiers) along the node. A decision is required on whether to include end of line monitoring as part of HSD monitoring system effort. The inclusion of end of line monitoring would increase the size and scope of the monitoring project with the benefit of providing unprecedented HFC monitoring reliability. This increase takes into account additional development and hardware deployment.

4.5 Learning and Development (L&D) involvement:

Regions spend an increasing amount of resources training new and existing staff. This training may range from an orientation for new hires on network management basics and systems to updates on new techniques or processes and can be expensive. For example, the region may tie up their best people (or at least the ones most able to explain or teach the systems in use) and/or incur costs associated with sending employees to off site training. These costs could be offset if L&D was a stakeholder in the handoff process of deploying a network management system. L&D could then obtain all necessary materials and documentation for regions to train NOC staff within their own learning centers. L&D could also be responsible for updating materials and its training with that supplied by the vendor. Inclusion of L&D requires a commitment from regions to work with this organization to supply its training on network management. This decision should be made early on in the project to ensure that timelines will be met and necessary resources within the L&D organization are allocated.

4.6 Information Technology (IT) involvement:

Regional support for business critical computer hardware/software typically comes from IT. Regional IT is staffed with sufficient expertise to build and maintain business critical systems. HSD monitoring is one such system that demands IT involvement to ensure it is maintained properly (i.e. OS updates are installed, applications and data are backed up regularly, user accounts are administered, security is closely guarded, etc.). Regional NOC staff are ill equipped to deal with day-to-day regional network management operations along with administering their equipment. It is also difficult to hire and keep individuals with this kind of skill set. Thus a decision to work with regional IT is required early on. Any changes in the current set of applications already supported by regional IT needs to be approved. Regional IT may also place additional requirements on the hardware or operating system platform they prefer to support. Delays in regional involvement may lengthen deployment.

5.0 References:

  1. "Guide to HSD Service Interruptions", Revision 0.20, Bruce Bahlmann, 01 December 1999.
  2. "Notes from conversations with regional NOC staff", Bruce Bahlmann, 20 December 1999.
  3. "Centralizing Network Management", a presentation by Don Williams, 12 December 1999.

  Can Birds-Eye.Net help you or your Company?
Receive your Birds-Eye.Net articles and white papers hot off the presses by adding our RSS feed to your reader.

(C) Copyright Birds-Eye.Net, All rights reserved.
It is against the law to reproduce this content or any portion of it in any form without the explicit written permission of Birds-Eye Network Services, LLC. Federal copyright law (17 USC 504) makes it illegal, punishable with fines up to $100,000 per violation plus attorney's fees.