Birds-Eye.Net
All things broadband and more...
 
Web Birds-Eye.Net
What's New?

Download Purchased Items

Research:
Analysis
International

Reference:
Acronyms & Definitions
Articles
Broadband Directory
Legacy
Operations
Technical
Yearly Predictions
> RSS Feeds <

Business Forms:
Due Diligence Checklist
Funding & VC Due Diligence
Real Estate Due Diligence

Resources:
Monitoring/Reporting/Benchmarking
Patent Harvesting Kit
Ready to Use Scripts
Source Code

Referral:
Expert Consulting
Referral

Other:
Advertise With Us
Feedback
Recommended Reading
Fishing
House
Baby in the City
Blog

Client Experience Monitor
Monitoring your network from the perspective of your customers' perceived experience

By: Bruce Bahlmann - Contributing Author (your feedback is important to us!)

Created: August 24, 1999

Note: For help designing your client-experience monitoring program or developing tools to help you improve or implement such a program contact Birds-Eye.Net.

Overview:

The customer demand for quality Internet access is prompting a change in the way Internet information services (or high-speed Internet Service – HSD) will be marketed in the future. As a result, traditional measurement applications of Internet service will give rise to more sophisticated applications which focus on customer experience and quality. An application called a client experience monitor (CEM) has proven potential to provide affiliates with the information they need to quantify the level of service they receive from Internet providers and guide future agreements for continued service. A working prototype of the CEM is explained as well as a snapshot of the data that has been collected.

Background:

Running an Information Service requires a high degree of technical expertise and “most” importantly -- consistency. As the Internet rushes into an increasing number of customer homes, the demand to sustain the load generated by new customers will require substantial attention of Internet providers. The case where the Internet service supplied to customers is essentially an “always on” connection provides the most challenging aspect of maintaining a performance and scalability of core Internet services.

Core Internet services for “always on” connection providers are indicated in Table 1.0. Internet services such as DHCP, BOOTP, TFTP, and NTP provide the basis for a cable modem to function and are of the infrastructure service type. Other services such as DNS, FTP, HTTP, NNTP, Ping, SMTP, and Traceroute are all client service types. The remaining services are used by Internet provider’s operations (Ops) staff to monitor, sustain, and troubleshoot the previous services. 

A relationship exists between Internet providers and their affiliates. Affiliates provide Information services to customers of which Internet information service is but one component. Internet providers supply the facilities to enable an affiliate to provide Internet services to its customers. This relationship is governed by a contract called a service level agreement (SLA) among other agreements. The service level agreement binds the affiliate to the Internet provider and defines the level of service expected in return by the affiliate. Within the SLA are several points of interest to this document. Notably, the “Key Performance Indicators” and the “Network Services Conformance” sections provide the operational parameters that the Internet provider has committed to supplying. Key performance indicators are focused on response to outages or escalations where network services conformance is concerned with availability. The rest of this document will focus on the subject of availability. 

Availability 

One of the commonly used terms with regard to providing Internet service is “Availability”. Availability is defined as capable of being obtained and/or accessible for use. Internet providers use the word availability to signify the amount of reliability they intend to provide with respect to various services they supply. Availability is typically defined in terms of percent (%) with higher percents equating to higher reliability.  

The availability projections within the SLA are usually based on the Internet provider’s “best effort” to measure the accessibility of the services they provide. One of most common tools in use today to measure availability is ping. The ping application communicates with Internet hosts to determine their operational status. For example if a host is operational (or “up”) it is reported as “alive” by the ping application. If the host is not operational (or “down”), it reports “no response” or “request time out” by the ping application. Although the ping application is a useful operational tool on the Internet, it is not a very reliable means of measuring availability. For example, the host may be up but the application (or service) supplied by the host could be down. In this case the availability is reported incorrectly. As a result, there is a difference between application availability (measured via the application’s client) and host availability (measured via ping). 

Surprisingly, the Internet provider often does the only monitoring of availability levels to measure its compliance established in the SLA. The Internet provider supplies this because the affiliate does not always have the means to do this on their own. However, SLA’s typically do not stipulate the type of monitoring (application availability or host availability) they require. In absence of any specific request for monitoring method, host availability is likely reported as the default as it’s the easiest to obtain. As a result the monitoring data reported by the Internet provider often does not reflect the actual availability seen from a typical customer’s perspective.  

Since the affiliate is ultimately responsible for providing the service (or seen in the eyes of the customer as responsible for sustaining reliable Internet service), it must seek ways to provide the highest quality service possible. One of the best ways to provide reliable service would be to pass along these requirements to Internet provider. The following suggests some ways to accomplish this: 

·        Establish some means of confirming the quality and reliability of the service supplied by the Internet provider.

·        Establish motivations for the Internet provider to seek the highest availability possible.

·        Provide customers with access to current status of various applications, scheduled outage windows, etc.

·        Provide the data needed to make more informative decisions regarding handling customer trouble calls and coordinating requested upgrades by Internet provider. 

Providing reliable Internet service helps the affiliate in the following ways: 

·        Increased availability (higher reliability) means lower trouble calls and potentially fewer truck rolls. Every call answered that is trouble related is potentially one less sales call answered.

·        Increased availability means higher customer confidence in providing Internet service via cable TV lines and thus opens doors for sales in new markets

·        Increased availability also means more satisfied customers which translates into greater demand 

The impact that availability has on things like call volume, truck rolls, and higher sales is not known at this time. However, a tool that allows one to measure availability to the minute could be used to track call volume, look for trends, and establish some relationships between the two. At the time of this writing, it seems reasonable to expect that there is a relationship between call volume and availability. It is projected that further analysis could potentially derive a cost factor per customer that is absorbed by the affiliate as a result of lowered availability. Additionally, the cost calculated could in turn be used to establish minimum acceptable availability levels an affiliate will accept. Thus having a tool that could provide affiliates with up to the minute calculations on availability could help them better understand the relationships between availability and support costs and reduce the burden that lower availability has on affiliates. 

Providing motivations to Internet providers is a key to establishing realistic minimum application service levels. Obtaining the history of an Internet provider’s performance, one can establish the average service availability level provided. This average availability level could then be used to drive the affiliate’s required service availability levels. Combine this with impact studies above could result in the affiliate providing incentives for the Internet provider to perform above their required service availability – such has a kick back premium per customer. Like-wise, service availability levels below the required levels would result in service discounts per customer (to enable the affiliate to recover the added support costs that were the result of lower availability levels). Providing these kinds of incentives would allow availability to be treated equally with other methods of evaluating an Internet provider’s performance.  

Informative Execution 

Having the application availability information provides affiliates with the means to make informed decisions regarding escalation of calls to the Internet provider’s tier two services, scheduling of service calls, and acceptance of system upgrades. In fact, this information could actually drive affiliate requests for specific application performance upgrades in some cases. Making informed decisions is a key to cost savings and reduction in outages caused by unnecessary upgrades. Targeting capital expenditures to areas of need (a type of scratch where it itches approach towards network upgrades) provides Internet providers with a means of controlling costs and increased operational efficiency. 

Consideration of client performance as a driving factor for application availability levels has not yet reached the main stream and “quality” features such as availability and reliability play a limited role in today’s customer selection of an Internet information service. However, as customer’s choices of Internet access become more equal in terms of speed, capability, price, and flexibility, “quality” will be what differentiates one Internet Information service from another.  

As the market for Internet service shifts gears to begin focusing on quality, affiliates need to be ready to quantify the service levels they want to provide. Work at home customers will be one of the first to demand the highest possible levels of service and will likely compare various options before buying. Having access to up to the minute service levels will enable marketing to go after these highly demanding customers. Thus the need for such a tool or system to drive up service availability levels and empower affiliates continued growth in the future.

Application:

Protocol:

Protocol:

Service Type:

Min TO:

Max TO:

TO Used:

Perl Module:

Port:

BOOTP

Bootstrap protocol

UDP

Infrastructure

 

 

 

 

67s

68c

DHCP

Bootstrap protocol

UDP

Infrastructure

2 sec

32 sec

2 sec

 

67s

68c

DNS

Domain name system

UDP/TCP

Client

NA

120 sec

1 sec

Net::DNS

Socket

53

FTP

File transfer

TCP

Client

 

 

 

 

 

HTTP

The Web

TCP

Client

 

 

2 sec

LWP::Simple

80

NFS

Network file system

UDP/TCP

Ops

 

 

 

 

 

NNTP

Network news

TCP

Client

 

 

20 sec

News::NNTPClient

News::NNTPFetchProgress

 

NTP

Time protocol

UDP

Infrastructure

 

 

1 sec

Net::Time

 

Ping

 

ICMP

Client/Ops

5 sec

 

 

Net::Ping

 

SMTP

Electronic mail

TCP

Client

 

 

 

 

 

 

POP3

---

 

NA

60 sec

30 sec

Mail::POP3Client

 

 

IMAP

---

 

 

 

 

 

 

SNMP

Network Management

UDP

Ops

 

 

 

 

 

Telnet

Remote login

TCP

Ops

 

 

 

 

 

TFTP

Trivial FTP

UDP

Infrastructure

2 sec

8 sec

6 sec

TFTP.pm

69

Traceroute

 

ICMP/UDP

Client/Ops

 

 

 

 

 

Table 1.0 Internet Application Chart

 

Index:     Description:

c              Client port (if specified)

NA          Information not available

s              Server port

Client Experience Monitoring (CEM) Architecture: 

Design Goals & Hypothesis: 

The goal of the CEM is to regularly perform “client-like” tasks. The CEM is responsible for storing application response results along-side “traditional” availability tests (pings - which are performed in parallel). This data will enable separate CEM tools to produce a periodic reports to summarize compliance with service level agreement, and produce a client experience rating based on the responsiveness of the applications supplied by the Internet provider. 

It is projected that a delta exists between up time (from a client’s perspective) and application availability reported by the Internet provider. The delta will be the result of degradation in application performance to a point where it is unacceptable to the client (or noticeably impacts its ability to use the service). During these periods of degradation the application availability will remain unchanged when in actuality, the application is “effectively down” from a client’s perspective. 

It is also projected that a relationship between call volume and application availability exits. The increase in call volume as a result of a decrease in availability would provide evidence of an additional metric that must be considered with respect to the SLA as its currently absorbed by the affiliate. 

Additionally, it is projected that during application outages the availability of these applications will fail to depict the actual accessibility of resources provided by the Internet provider due to the resulting increase in load. Instead, the application is “effectively down” much longer from the client’s perspective. 

The CEM and its data will seek to provide affiliates with a reliable means to monitor the Internet provider’s compliance with the SLA. Monitoring of client experience will strive to eliminate potential bottlenecks or single points of failure to provide the most accurate measurement possible. The CME will also seek to establish a range of “acceptable” client experience ratings. This range is expected to raise the bar on the Internet provider’s application performance to account for quantifiable demands by the affiliate for higher service quality and capacity.

Prototype:

The CEM prototype* (or alpha module) design is very simple. The steps to building the CEM are the following: 

·        Obtain a list (Table 1.0) of all the applications required to maintain Internet access for clients

·        Create simple clients for each of these applications (most are publicly available)

·        Combine all clients into a single application capable of testing all application types

·        Create a data model that supports the CEM design goals

·        Create a user interface to enter applications into the system

·        Create a user interface to display the status of the applications

·        Isolate the CEM from the RF plant where it could be affected by affiliate controlled assets 

*Note that from here on, the “proposed” CEM will represent the desired state of the CEM (proposed in a separate document) where the CEM would become an enterprise-wide application. 

The design of the CEM is based purely on a “proof of concept” and not completely ready for production use (only minor modifications and testing are necessary to deploy this product). The goal of building the prototype is to demonstrate a working CEM and collect sample data for analysis and hypothesis confirmation. The prototype will also provide direction for follow-on work (if approved) and serve as an example for future efforts and/or spin-off projects. 

Figure 1.0 describes the components of the CEM prototype. From right to left the components are: 

·        Service Provider Servers – These servers constitute the applications supplied by the Internet provider to maintain client access to the Internet. These servers (represent those listed in Table 1) supply necessary configuration and information to regional clients enabling them to access the Internet.

·        Regional HSD Network – This cloud represents the regional network segment of the Internet provider’s domain, which enable regional affiliate customers to access the Internet. This cloud consists of several routers and high-speed links that inter-connect all customers and applications to the Internet.

·        Client Experience Monitor – An application that is designed to test applications supplied by the Internet provider for how well they respond. The results of these tests are stored in its datastore for further analysis.

·        Datastore – A data repository of collected performance and availability information collected by the CEM. This database also contains information stored by the User Interface that allows additions and modifications to the applications being polled by the CEM. The physical location of the CEM host enables it to not be impacted by cable TV (CATV) outages and measure the “potential” availability for all customers whether they are working or not.

·        Web Server – A portal for information flowing in and out of the CEM datastore. The web server provides a universally acceptable interface that is platform independent and offers a variety of well-established access security mechanisms.

·        User Interface – A CGI type web interface used to maintain CEM data. Through the User Interface, the CEM and what it touches can be managed. This interface permits changes to the polling list of machines and the SLA parameters.

·        Regional NOC --  Is a secondary web interface used for reporting purposes only. Information reported includes (among other things) the status of all applications being polled. This interface (or web page) is automatically updated and the web page is set up to refresh itself.

Figure 1.0 Regional Client Experience Monitor Components 

Configuration Data:

The architecture of the CEM is such that regions can maintain the current list of applications (or servers) required to adequately service their customers. This architecture for data entry is bottom-up in that the regions manage all the data applications monitored. The following data ONLY represents the minimum information needed to poll the device. Additional information could be added for identification, escalation, and/or categorization purposes. These fields can be added at any time without impact to the operation of the CEM. 

Variable:

Value Type:

Description:

<Key>

String

Name of application

ip

string

IP address of the application

dns

String

Domain Name System (DNS) name associated with application

TBC

---

To be continued…

Data Model: 

The data captured above determines which applications and their associated Internet hosts (hardware) will be polled by the CEM. The CEM simply reads from the database to determine which applications/hosts it should poll. If the application name matches one that the CEM supports* it is then processed fully. 

*Note - At this time a few applications in Table 1.0 have not yet been incorporated into the CEM. Some applications were excluded from the CEM prototype to speed its development. However, all applications would be included in the proposed product. 

The following represent the data elements collected by the CEM. These elements are grouped in categories that allow them to be explained in more detail. 

CEM Internal Data:

Variable:

Value Type:

Description:

day

String

Current day

pday

String

Previous day

month

String

Current month

pmonth

String

Previous month

date

Integer

Current date (or numeral day of month)

pdate

Integer

Previous date (same as above)

time

String

Current Time of day (xx:xx:xx) 24 hour

ptime

String

Previous time of day (same as above)

year

Integer

Current year

pyear

Integer

Previous year

ts

Integer

Current integer date in date time group (dtg) format

 The CEM uses several internal data components to track changes in day, month, year, and time. These data elements are stored in a record that only the CEM accesses and that the configuration interface ignores. For design purposes, the date time group (dtg) format mentioned above is an integer representing the seconds since 1970 measured according the the local time zone. All internal elements are time zone specific except the current integer date (ts) which is given in an integer representing the seconds since 1970 in Grenich Mean Time (GMT). Storing time information in both GMT and local time allow the CEM to be used in both a enterprise-wide and region-wide applications. 

Application Core Data:

Variable:

Value Type:

Description:

bdpng

Integer

Begin down time – ping

bdrsp

Integer

Begin down time - response

cpng

Real

Current (last) ping time

crsp

Real

Current (last) response time

edate

Integer

Entry date time group of polled info (dtg)

edpng

Integer

End down time – ping (dtg)

edrsp

Integer

End down time – response (dtg)

dp

Boolean

Down flag – ping

dr

Boolean

Down flag – response

The data elements within the application core manage the operation and outage handling of each application in the database. Begin time (bdpng/rsp) provides a storage location for the beginning of outage events and end time (edpng/rsp) marks the recovery time of the outage. These markers allow the CEM to calculate the duration of specified outages. 

The other key elements are the current response time (cpng, crsp) of each application and the entry date (edate). This “current response time” data is used by all of the remaining data categories to track each application’s history. The entry date provides the CEM with information regarding when the application first entered CEM managed list of applications. Having the entry date allows each application to be tracked back to its introduction into the CEM. Since applications will multiply with increased load, this allows new applications to be entered yet tracked back to their unique entry date. All calculations for each application only go as far back as this date. 

The down flags (ping & response) enable the CEM prototype to remember state through successive operations. This functionality must be part of the CEM prototype because it was not constructed as a daemon* but rather as a simple application that once executed performs its function quickly and then exists. The CEM run frequency is set by a cron event configured by the administrator (or root) of the CEM host. The prototype currently runs every minute to poll the devices in its database. Any outage detected by the CEM prototype will be forgotten at the completion of its run. However, by using the down flags, CEM prototype is able to regain each application’s previous status once its read from the database.  

*Note a daemon is an application that continuously runs and can maintain state by saving it in its running memory. However, the CEM prototype is NOT a daemon. The proposed CEM could be constructed as a daemon while retaining similar functionality to that of the prototype. If the CEM was a daemon, the polling frequency could be set to less than once a minute (if that was desirable). 

Summary Data:

Variable:

Value Type:

Description:

tpolled

Integer

Total number of times polled

tpng

Real

Total ping time

trsp

Real

Total response time

tdrsp

Real

Total down time – response