|
Client Experience Monitor
Monitoring your network from the perspective of your customers'
perceived experience
By: Bruce Bahlmann - Contributing Author (your
feedback
is important to us!)
Created: August 24, 1999
Note: For help designing your client-experience monitoring program or developing tools to help you improve or implement such a program contact Birds-Eye.Net.
Overview:
The customer demand for quality Internet access is prompting a change
in the way Internet information services (or high-speed Internet Service HSD) will
be marketed in the future. As a result, traditional measurement applications of Internet
service will give rise to more sophisticated applications which focus on customer
experience and quality. An application called a client experience monitor (CEM) has proven
potential to provide affiliates with the information they need to quantify the level of
service they receive from Internet providers and guide future agreements for continued
service. A working prototype of the CEM is explained as well as a snapshot of the data
that has been collected.
Background:
Running an Information Service requires a high degree of technical
expertise and most importantly -- consistency. As the Internet rushes into an
increasing number of customer homes, the demand to sustain the load generated by new
customers will require substantial attention of Internet providers. The case where the
Internet service supplied to customers is essentially an always on connection
provides the most challenging aspect of maintaining a performance and scalability of core
Internet services.
Core Internet services for always on connection providers
are indicated in Table 1.0. Internet services such as DHCP, BOOTP, TFTP, and NTP provide
the basis for a cable modem to function and are of the infrastructure service type. Other
services such as DNS, FTP, HTTP, NNTP, Ping, SMTP, and Traceroute are all client service
types. The remaining services are used by Internet providers operations (Ops) staff
to monitor, sustain, and troubleshoot the previous services.
A relationship exists between Internet providers and their affiliates.
Affiliates provide Information services to customers of which Internet information service
is but one component. Internet providers supply the facilities to enable an affiliate to
provide Internet services to its customers. This relationship is governed by a contract
called a service level agreement (SLA) among other agreements. The service level agreement
binds the affiliate to the Internet provider and defines the level of service expected in
return by the affiliate. Within the SLA are several points of interest to this document.
Notably, the Key Performance Indicators and the Network Services
Conformance sections provide the operational parameters that the Internet provider
has committed to supplying. Key performance indicators are focused on response to outages
or escalations where network services conformance is concerned with availability. The rest
of this document will focus on the subject of availability.
Availability
One of the commonly used terms with regard to providing Internet
service is Availability. Availability is defined as capable of being obtained
and/or accessible for use. Internet providers use the word availability to signify the
amount of reliability they intend to provide with respect to various services they supply.
Availability is typically defined in terms of percent (%) with higher percents equating to
higher reliability.
The availability projections within the SLA are usually based on the
Internet providers best effort to measure the accessibility of the
services they provide. One of most common tools in use today to measure availability is
ping. The ping application communicates with Internet hosts to determine their operational
status. For example if a host is operational (or up) it is reported as
alive by the ping application. If the host is not operational (or
down), it reports no response or request time out by
the ping application. Although the ping application is a useful operational tool on the
Internet, it is not a very reliable means of measuring availability. For example, the host
may be up but the application (or service) supplied by the host could be down. In this
case the availability is reported incorrectly. As a result, there is a difference between
application availability (measured via the applications client) and host
availability (measured via ping).
Surprisingly, the Internet provider often does the only monitoring of
availability levels to measure its compliance established in the SLA. The Internet
provider supplies this because the affiliate does not always have the means to do this on
their own. However, SLAs typically do not stipulate the type of monitoring
(application availability or host availability) they require. In absence of any specific
request for monitoring method, host availability is likely reported as the default as
its the easiest to obtain. As a result the monitoring data reported by the Internet
provider often does not reflect the actual availability seen from a typical
customers perspective.
Since the affiliate is ultimately responsible for providing the service
(or seen in the eyes of the customer as responsible for sustaining reliable Internet
service), it must seek ways to provide the highest quality service possible. One of the
best ways to provide reliable service would be to pass along these requirements to
Internet provider. The following suggests some ways to accomplish this:
·
Establish some means of confirming the quality and reliability of the
service supplied by the Internet provider.
·
Establish motivations for the Internet provider to seek the highest
availability possible.
·
Provide customers with access to current status of various applications,
scheduled outage windows, etc.
·
Provide the data needed to make more informative decisions regarding
handling customer trouble calls and coordinating requested upgrades by Internet provider.
Providing reliable Internet service helps the affiliate in the
following ways:
·
Increased availability (higher reliability) means lower trouble calls and
potentially fewer truck rolls. Every call answered that is trouble related is potentially
one less sales call answered.
·
Increased availability means higher customer confidence in providing
Internet service via cable TV lines and thus opens doors for sales in new markets
·
Increased availability also means more satisfied customers which translates
into greater demand
The impact that availability has on things like call volume, truck
rolls, and higher sales is not known at this time. However, a tool that allows one to
measure availability to the minute could be used to track call volume, look for trends,
and establish some relationships between the two. At the time of this writing, it seems
reasonable to expect that there is a relationship between call volume and availability. It
is projected that further analysis could potentially derive a cost factor per customer
that is absorbed by the affiliate as a result of lowered availability. Additionally, the
cost calculated could in turn be used to establish minimum acceptable availability levels
an affiliate will accept. Thus having a tool that could provide affiliates with up to the
minute calculations on availability could help them better understand the relationships
between availability and support costs and reduce the burden that lower availability has
on affiliates.
Providing motivations to Internet providers is a key to establishing
realistic minimum application service levels. Obtaining the history of an Internet
providers performance, one can establish the average service availability level
provided. This average availability level could then be used to drive the affiliates
required service availability levels. Combine this with impact studies above could result
in the affiliate providing incentives for the Internet provider to perform above their
required service availability such has a kick back premium per customer. Like-wise,
service availability levels below the required levels would result in service discounts
per customer (to enable the affiliate to recover the added support costs that were the
result of lower availability levels). Providing these kinds of incentives would allow
availability to be treated equally with other methods of evaluating an Internet
providers performance.
Informative Execution
Having the application availability information provides affiliates
with the means to make informed decisions regarding escalation of calls to the Internet
providers tier two services, scheduling of service calls, and acceptance of system
upgrades. In fact, this information could actually drive affiliate requests for specific
application performance upgrades in some cases. Making informed decisions is a key to cost
savings and reduction in outages caused by unnecessary upgrades. Targeting capital
expenditures to areas of need (a type of scratch where it itches approach towards network
upgrades) provides Internet providers with a means of controlling costs and increased
operational efficiency.
Consideration of client performance as a driving factor for application
availability levels has not yet reached the main stream and quality features
such as availability and reliability play a limited role in todays customer
selection of an Internet information service. However, as customers choices of
Internet access become more equal in terms of speed, capability, price, and flexibility,
quality will be what differentiates one Internet Information service from
another.
As the market for Internet service shifts gears to begin focusing on
quality, affiliates need to be ready to quantify the service levels they want to provide.
Work at home customers will be one of the first to demand the highest possible levels of
service and will likely compare various options before buying. Having access to up to the
minute service levels will enable marketing to go after these highly demanding customers.
Thus the need for such a tool or system to drive up service availability levels and
empower affiliates continued growth in the future.
Application: |
Protocol: |
Protocol: |
Service Type: |
Min TO: |
Max TO: |
TO Used: |
Perl Module: |
Port: |
BOOTP |
Bootstrap protocol |
UDP |
|
|
|
|
|
67s
68c |
DHCP |
Bootstrap protocol |
UDP |
|
|
32 sec |
2 sec |
|
67s
68c |
DNS |
Domain name system |
UDP/TCP |
|
|
120 sec |
1 sec |
Net::DNS
Socket |
53 |
FTP |
File transfer |
TCP |
|
|
|
|
|
|
HTTP |
The Web |
TCP |
|
|
|
2 sec |
LWP::Simple |
80 |
NFS |
Network file system |
UDP/TCP |
|
|
|
|
|
|
NNTP |
Network news |
TCP |
|
|
|
20 sec |
News::NNTPClient
News::NNTPFetchProgress |
|
NTP |
Time protocol |
UDP |
|
|
|
1 sec |
Net::Time |
|
Ping |
|
ICMP |
|
|
|
|
Net::Ping |
|
SMTP |
Electronic mail |
TCP |
|
|
|
|
|
|
|
POP3 |
--- |
|
|
60 sec |
30 sec |
Mail::POP3Client |
|
|
IMAP |
--- |
|
|
|
|
|
|
SNMP |
Network Management |
UDP |
|
|
|
|
|
|
Telnet |
Remote login |
TCP |
|
|
|
|
|
|
TFTP |
Trivial FTP |
UDP |
|
|
8 sec |
6 sec |
TFTP.pm |
69 |
Traceroute |
|
ICMP/UDP |
|
|
|
|
|
|
Table 1.0 Internet Application Chart
Index: Description:
c
Client port (if specified)
NA
Information not available
s
Server port
The goal of the CEM is to regularly perform client-like
tasks. The CEM is responsible for storing application response results along-side
traditional availability tests (pings - which are performed in parallel). This
data will enable separate CEM tools to produce a periodic reports to summarize compliance
with service level agreement, and produce a client experience rating based on the
responsiveness of the applications supplied by the Internet provider.
It is projected that a delta exists between up time (from a
clients perspective) and application availability reported by the Internet provider.
The delta will be the result of degradation in application performance to a point where it
is unacceptable to the client (or noticeably impacts its ability to use the service).
During these periods of degradation the application availability will remain unchanged
when in actuality, the application is effectively down from a clients
perspective.
It is also projected that a relationship between call volume and
application availability exits. The increase in call volume as a result of a decrease in
availability would provide evidence of an additional metric that must be considered with
respect to the SLA as its currently absorbed by the affiliate.
Additionally, it is projected that during application outages the
availability of these applications will fail to depict the actual accessibility of
resources provided by the Internet provider due to the resulting increase in load.
Instead, the application is effectively down much longer from the
clients perspective.
The CEM and its data will seek to provide affiliates with a reliable
means to monitor the Internet providers compliance with the SLA. Monitoring of
client experience will strive to eliminate potential bottlenecks or single points of
failure to provide the most accurate measurement possible. The CME will also seek to
establish a range of acceptable client experience ratings. This range is
expected to raise the bar on the Internet providers application performance to
account for quantifiable demands by the affiliate for higher service quality and capacity.
The CEM prototype* (or alpha module) design is very simple. The steps
to building the CEM are the following:
·
Obtain a list (Table 1.0) of all the applications required to maintain
Internet access for clients
·
Create simple clients for each of these applications (most are publicly
available)
·
Combine all clients into a single application capable of testing all
application types
·
Create a data model that supports the CEM design goals
·
Create a user interface to enter applications into the system
·
Create a user interface to display the status of the applications
·
Isolate the CEM from the RF plant where it could be affected by affiliate
controlled assets
*Note that from here on, the
proposed CEM will represent the desired state of the CEM (proposed in a
separate document) where the CEM would become an enterprise-wide application.
The design of the CEM is based purely on a proof of concept
and not completely ready for production use (only minor modifications and testing are
necessary to deploy this product). The goal of building the prototype is to demonstrate a
working CEM and collect sample data for analysis and hypothesis confirmation. The
prototype will also provide direction for follow-on work (if approved) and serve as an
example for future efforts and/or spin-off projects.
Figure 1.0 describes the components of the CEM prototype. From right to
left the components are:
·
Service Provider Servers
These servers constitute the applications supplied by the Internet provider to maintain
client access to the Internet. These servers (represent those listed in Table 1) supply
necessary configuration and information to regional clients enabling them to access the
Internet.
·
Regional HSD Network This
cloud represents the regional network segment of the Internet providers domain,
which enable regional affiliate customers to access the Internet. This cloud consists of
several routers and high-speed links that inter-connect all customers and applications to
the Internet.
·
Client Experience Monitor
An application that is designed to test applications supplied by the Internet provider for
how well they respond. The results of these tests are stored in its datastore for further
analysis.
·
Datastore A data repository
of collected performance and availability information collected by the CEM. This database
also contains information stored by the User Interface that allows additions and
modifications to the applications being polled by the CEM. The physical location of the
CEM host enables it to not be impacted by cable TV (CATV) outages and measure the
potential availability for all customers whether they are working or not.
·
Web Server A portal for
information flowing in and out of the CEM datastore. The web server provides a universally
acceptable interface that is platform independent and offers a variety of well-established
access security mechanisms.
·
User Interface A CGI type
web interface used to maintain CEM data. Through the User Interface, the CEM and what it
touches can be managed. This interface permits changes to the polling list of machines and
the SLA parameters.
·
Regional NOC -- Is a secondary web interface used for reporting
purposes only. Information reported includes (among other things) the status of all
applications being polled. This interface (or web page) is automatically updated and the
web page is set up to refresh itself.

Figure 1.0 Regional Client
Experience Monitor Components
The architecture of the CEM is such that regions can maintain the
current list of applications (or servers) required to adequately service their customers.
This architecture for data entry is bottom-up in that the regions manage all the data
applications monitored. The following data ONLY represents the minimum information needed
to poll the device. Additional information could be added for identification, escalation,
and/or categorization purposes. These fields can be added at any time without impact to
the operation of the CEM.
Variable: |
Value
Type: |
Description: |
<Key> |
String |
Name
of application |
ip |
string |
IP
address of the application |
dns |
String |
Domain
Name System (DNS) name associated with application |
TBC |
--- |
To
be continued
|
The data captured above determines which applications and their
associated Internet hosts (hardware) will be polled by the CEM. The CEM simply reads from
the database to determine which applications/hosts it should poll. If the application name
matches one that the CEM supports* it is then processed fully.
*Note - At this time a few
applications in Table 1.0 have not yet been incorporated into the CEM. Some applications
were excluded from the CEM prototype to speed its development. However, all applications
would be included in the proposed product.
The following represent the data elements collected by the CEM. These
elements are grouped in categories that allow them to be explained in more detail.
CEM Internal Data:
Variable: |
Value
Type: |
Description: |
day |
String |
Current
day |
pday |
String |
Previous
day |
month |
String |
Current
month |
pmonth |
String |
Previous
month |
date |
Integer |
Current
date (or numeral day of month) |
pdate |
Integer |
Previous
date (same as above) |
time |
String |
Current
Time of day (xx:xx:xx) 24 hour |
ptime |
String |
Previous
time of day (same as above) |
year |
Integer |
Current
year |
pyear |
Integer |
Previous
year |
ts |
Integer |
Current
integer date in date time group (dtg) format |
The CEM uses several internal data components to track changes in
day, month, year, and time. These data elements are stored in a record that only the CEM
accesses and that the configuration interface ignores. For design purposes, the date time
group (dtg) format mentioned above is an integer representing the seconds since 1970
measured according the the local time zone. All internal elements are time zone specific
except the current integer date (ts) which is given in an integer representing the seconds
since 1970 in Grenich Mean Time (GMT). Storing time information in both GMT and local time
allow the CEM to be used in both a enterprise-wide and region-wide applications.
Application Core Data:
Variable: |
Value
Type: |
Description: |
bdpng |
Integer |
Begin
down time ping |
bdrsp |
Integer |
Begin
down time - response |
cpng |
Real |
Current
(last) ping time |
crsp |
Real |
Current
(last) response time |
edate |
Integer |
Entry
date time group of polled info (dtg) |
edpng |
Integer |
End
down time ping (dtg) |
edrsp |
Integer |
End
down time response (dtg) |
dp |
Boolean |
Down
flag ping |
dr |
Boolean |
Down
flag response |
The data elements within the application core manage the operation and
outage handling of each application in the database. Begin time (bdpng/rsp) provides a
storage location for the beginning of outage events and end time (edpng/rsp) marks the
recovery time of the outage. These markers allow the CEM to calculate the duration of
specified outages.
The other key elements are the current response time (cpng, crsp) of
each application and the entry date (edate). This current response time data
is used by all of the remaining data categories to track each applications history.
The entry date provides the CEM with information regarding when the application first
entered CEM managed list of applications. Having the entry date allows each application to
be tracked back to its introduction into the CEM. Since applications will multiply with
increased load, this allows new applications to be entered yet tracked back to their
unique entry date. All calculations for each application only go as far back as this date.
The down flags (ping & response) enable the CEM prototype to
remember state through successive operations. This functionality must be part of the CEM
prototype because it was not constructed as a daemon* but rather as a simple application
that once executed performs its function quickly and then exists. The CEM run frequency is
set by a cron event configured by the administrator (or root) of the CEM host. The
prototype currently runs every minute to poll the devices in its database. Any outage
detected by the CEM prototype will be forgotten at the completion of its run. However, by
using the down flags, CEM prototype is able to regain each applications previous
status once its read from the database.
*Note a daemon is an application
that continuously runs and can maintain state by saving it in its running memory. However,
the CEM prototype is NOT a daemon. The proposed CEM could be constructed as a daemon while
retaining similar functionality to that of the prototype. If the CEM was a daemon, the
polling frequency could be set to less than once a minute (if that was desirable).
Summary Data:
Variable: |
Value
Type: |
Description: |
tpolled |
Integer |
Total
number of times polled |
tpng |
Real |
Total
ping time |
trsp |
Real |
Total
response time |
tdrsp |
Real |
Total
down time response |
tdpng |
Real |
Total
down time ping |
Summary data mainly consists of simple tallies of operational data
collected (cpng, crsp) or successful completions of the CEM polling operation for each
specific application (tpolled).
Yearly Data:
Variable: |
Value
Type: |
Description: |
byear |
Integer |
Beginning
of year for device (dtg) |
ydpng |
Real |
Yearly
down time ping |
ydtpng |
Integer |
Yearly
number of down times ping |
ydrsp |
Real |
Yearly
down time response |
ydtrsp |
Integer |
Yearly
number of down times response |
pydpng |
Real |
Previous
years down time ping |
pydrsp |
Real |
Previous
years down time response |
ytpng |
Real |
Yearly
total ping time |
yhpng |
Real |
Yearly
high ping time |
yhdpng |
Integer |
Yearly
high ping dtg |
ytrsp |
Real |
Yearly
total response time |
yhrsp |
Real |
Yearly
high response time |
yhdrsp |
Integer |
Yearly
high response dtg |
ypolled |
Integer |
Yearly
times polled |
Yearly data consists of various metrics used to record historical
application ping and response times for up to a whole year (depending on when the
application was introduced to the CEMs database). The byear marks the beginning of
the year and is used in conjunction with other yearly data to calculate this historical
information. Several tallies are maintained for the year including down time (ydpng,
ydrsp), total time (ytpng, ytrsp), previous down times (pydpng, pydrsp), highs (yhpng,
yhrsp), high dates (yhdpng, yhdrsp), down incidents (ydtpng, ydtrsp), and times polled for
the year (ypolled). Of these tallies, most are self explanatory except up down incidents.
These two tallies are used to calculate the average duration of up time for the
application (or stated another way, this information provides of calculating the average
time between outages).
Yearly data elements mimic those of monthly and daily thus these data
elements will not be explained in detail. The only difference between the yearly, monthly,
and daily elements is the first letter and the rate at which each are cleared and reported
by the CEM. The CEM ensures that which each change of day, month, and year that the
appropriate data elements are reset to provide their respective historical data for the
current day, month, and year.
Monthly Data:
Variable: |
Value
Type: |
Description: |
bmonth |
Integer |
Beginning
of month for device (dtg) |
mdpng |
Real |
Monthly
down time - ping |
mdtpng |
Integer |
Monthly
number of down times - ping |
mdrsp |
Real |
Monthly
down time response |
mdtrsp |
Integer |
Monthly
number of down times response |
pmdpng |
Real |
Previous
months down time - ping |
pmdrsp |
Real |
Previous
months down time response |
mtpng |
Real |
Monthly
total ping time |
mhpng |
Real |
Monthly
high ping time |
mhdpng |
Integer |
Monthly
high ping dtg |
mtrsp |
Real |
Monthly
total response time |
mhrsp |
Real
|
Monthly
high ping time |
mhdrsp |
Integer
|
Monthly
high ping dtg |
mpolled |
Integer |
Monthly
times polled |
Daily Data:
Variable: |
Value
Type: |
Description: |
bday |
Integer |
Beginning
of day for device (dtg) |
ddpng |
Real |
Daily
down time ping |
ddtpng |
Integer |
Daily
number of down times ping |
ddrsp |
Real |
Daily
down time response |
ddtrsp |
Integer |
Daily
number of down times response |
pddpng |
Real |
Previous
days down time ping |
pddrsp |
Real |
Previous
days down time response |
dtpng |
Integer |
Daily
total ping time |
dhpng |
Real |
Daily
high ping time |
dhdpng |
Integer |
Daily
high ping dtg |
dtrsp |
Real |
Daily
total response time |
dhrsp |
Real |
Daily
high response time |
dhdrsp |
Integer |
Daily
high response dtg |
dpolled |
Integer |
Daily
times polled |
The CEM was designed on a Sun Solaris but with a platform independent
programming language (Perl). While its mostly portable to other hardware platforms that
support Perl (nearly all), its highly recommended this remain on a Sun Solaris platform
for stability.
The client experience monitor has been in operation since September 2nd
and collected the following data*. Figure 2.0 represents the average response performance
detected by the client experience monitor. This information shows the following:
Heading: |
Description: |
Service: |
The application being polled |
IP: |
The IP address of the application host |
Status: |
Status of the host and/or application (UP/DOWN) |
Day: |
Average ping response time for the current day |
Month: |
Average ping response time for the current month |
Year: |
Average ping response time for the current year |
Overall: |
Average ping response time overall |
Day: |
Average application response time for the current day |
Month: |
Average application response time for the current month |
Year: |
Average application response time for the current year |
Overall: |
Average application response time overall |
*Note the web site that was built
to display the client experience monitor data only displays some of the actual data
contained in the database. Since this is a prototype not a lot of effort was expended to
display all the possible information in the database but rather just enough to show the
kinds of information being collected. Follow on efforts could re-work these web pages to
display a variety of information at varying levels of detail.
Increases/decreases in the average response times could be used to
diagnose increasing/decreasing network load.
Figure 2.0 Performance
snapshot
Figure 2.1 represents the availability detected by the client
experience monitor. This information shows the following:
Heading: |
Description: |
Service: |
The application being polled |
Day: |
Host availability for the current day |
Month: |
Host availability for the current month |
Year: |
Host availability for the current year |
Overall: |
Host availability overall |
Day: |
Application availability for the current day |
Month: |
Application availability for the current month |
Year: |
Application availability for the current year |
Overall: |
Application availability overall |
This information could be used to establish required application
availability metrics for SLA.

Figure 2.1 Availability
snapshot
Figure 2.2 represents some details of interest collected by the client
experience monitor. This information shows the following:
Heading: |
Description: |
Service: |
The application being polled |
Day: |
Ping response time high for the current day |
Mon: |
Ping response time high for the current month |
Yr: |
Ping response time high for the current year |
Day: |
Application response time high for the current day |
Mon: |
Application response time high for the current month |
Yr: |
Application response time high for the current year |
Day: |
Ping down time for the current day |
Mon: |
Ping down time for the current month |
Yr: |
Ping down time for the current year |
Tot: |
Ping down time total for service |
Day: |
Application down time for the current day |
Mon: |
Application down time for the current month |
Yr: |
Application down time for the current year |
Tot: |
Application down time total for service |
These details are helpful in understanding what is behind availability
numbers in terms of response highs and actual down time recorded. When availability is
reported merely as a percentage, all this information gets rolled up into a single number
that is not very meaningful unless one understands the details behind it.

Figure 2.2 Details section of Client Experience Monitor data
Service Level
Agreement Example,10 August 1999.
Can Birds-Eye.Net help you or your Company?
Receive your Birds-Eye.Net articles and white
papers hot off
the presses by adding our RSS feed to your reader.
|