This site has moved…

Please updated your links, this blog site has moved to: http://shawnennis.com

Concepts, Trade Shows

The Value of Realtime Digital Services


TMF Live 2016 wrapped up today and as I take my Uber back to NCE airport, oddly enough I experience the value of real-time digital services.   As most, I leverage the Uber ride-share service.   My reasons are as others: connivence, price, quality, etc.   As I have been coming to Nice for TMF Live, typically taxis are twice the cost.

As we are driving, the drivers phone beeped.   It told him that there was an accident up ahead and we needed to divert.  Interestingly, my phone beeped as he said this and I got the same message showing a red line up ahead.   The driver stated this was one of the reasons he switched to Uber, being a long time tax cab driver.   Because other Uber drivers are constantly, autonomously reporting traffic (way more than cab drivers do) he spends more time driving and less time in traffic.   He drives more customers and makes considerably more money.   The customers are happier, online bill pay provide less hassle – he drives, that is all he worries about.    The cost of Uber?   For him nothing, the passengers do that.   He drives and gets paid.   And is nice – offered me a paper (quaint) and free bottle of water before boarding.

So to review…    Uber based in California, 6,000 miles and 9 hours time difference away.   Using AWS hosting, it allows real-time automatic cross matching of traffic to make lives a little easier in Nice.   The mini to the macro at work here.   This 60+ year old driver, driving all his life, reaps the benefit.   I pay an extra 2e, 40% reduction in rates, smoother ride in a new car, and nicer driver — that is value for the customer.    What makes this miracle possible?   Realtime digital services.    Uber and others like them are winning the battle by pushing realtime digital services using LTE; competing against taxi cabs with CB radios.    As the newspaper industry realized already, the taxi cab industry will soon become… quaint…

My question to you?  What is your realtime service?   What does it mean to your business?  How do you assure it to continue to be realtime?

Contact Monolith Software today about AssureNow’s Unified Service Assurance.

PS.   Thanks T-Mobile for included international roaming.   Uber would not have been possible without you…

Concepts, Trade Shows

Service Assurance Challenges of IoT


A common question of the day, “what are we going to do in the IoT world?”.   My typical response to service providers is, “well, that was last week…”    All kidding aside, we live in the connected generation.   Network access is the new oxygen.   The price to be paid is complexity and scale.   A good reference for what IoT use cases exist is this bemyapp article about Ten B2B use cases for IoT.

But what needs to be discussed are how to group these, what are the common threads.   Its best to categorize them into three buckets.    Environmental monitoring of smart meters to reduce human interaction requirements.   Tracking logistics through RFID is another common trend with IoT communities.   The most common is client monitoring.    In mobility, handset tracking and trending is common in CEM.   In an access network its monitoring the cable modems for millions of customers.  Which ever category your use case may be, the challenges will be similar.    How do you deal with the fact that your network becomes tens of millions of small devices instead of thousands of regular sized devices?   How do you handle that fact that billions of pieces of data need to be processed, but only a fraction would be immediately useful?   How can you break down the network to human understandable segmentations?

The solution is simple – Unified Service Assurance.   With a single source of truth, you can see the forest through the trees.   While the “things” in IoT are important, how they relay information and perform their work are equally important.    Monitoring holistic allows better understanding of the IoT environment – single point solutions will not address IoT.  Normalizing data enables for higher scale, while maintaining the high reliability.

Now that the network has been unified into a single source of truth, operations can start simplification of their workload.    First step, become service oriented.   Performance, fault, and topology is too much data – its the services you must rely upon.   How are the doing, what are the problems, how to fix them, and where you need to augment your network.    Next up, correlate everything – you need to look at the 1% of the 1% of the 1% to be successful.  KQIs are necessary, because the trees in the forest are antidotal information – the AFFECT.   Seeing the forest (as the KQI) allows you to become proactive and move quicker, be more decisive because you understand the trends and what is normal.  Its time to stop let the network manage you, and start managing your network.

After unifying your view and simplifying your approach, its time to automate.    The whole point of IoT is massive scale and automation, but if your SA solution cannot integrate openly with the orchestration solution, how will you ever automate resolution & maintenance?   We all must realize, human-based lifecycle management is not possible at IoT scale.   Its time to match the value of your network with the value of managing it.

Learn more about Monolith Software’s AssureNow by scheduling a call today.

Concepts, Trade Shows

Smart City Service Assurance


Here at TMF Live 2016, I was fortunate to get more educated on the newest industry initiatives.    At the Smart City Dublin forum, the subject was how municipalities can save money and better enable their citizenry.   These opportunities are not being driven by cities themselves, but by innovative service providers offering exciting new services.    Cities have assets, like right-of-ways.    Cities have advancing needs, like tourism empowering free wifi.  Cities have challenges, like reducing budgets and stodgy policies.   While other service providers may shy away from engagement, others see this challenges as opportunities for new products and revenues.

The concept is simple.    Leverage the right aways (lamp posts), engage your NEPs, and install a wifi network enabled by advertising.   Smart cities can share in the ad-based profits and provide new tourist engaging services to grow the local community.   Through this service, provide a portal to citizens and tourists alike showing off the local digital economy.   Provide multi-tenant access showing other services from garbage collection to power outage notifications to enrich the peoples knowledge and grow the cities automation potential.   Reduce the cost and redundancy of all the services (basic to advanced) that the city is chartered to provide.

Where does service assurance come in?   The digital world is a unifying force.    Providing a single pane of glass is common sense, but unfortunately not common place.   Once deployed, the quality of city’s services define their brand.   The analog and digital services will need to be assured.  Proactive engagement is no longer a nice-to-have, its expected.    Leveraging a service assurance solution with proactive portal across all services will enable the smart city revolution.

Service providers, government, and equipment manufactures can be brought together to provide revenue positive new services (ie Dublin).   The question is how will service providers assure the quality and engage the populace?   Answer.   Unified Service Assurance.  Learn more about Unified Service Assurance with Monolith’s AssureNow.

Best Practices, Concepts

Quickest Route to Operational Success and Efficiency is to Empower your Operators


‘Doing more with less’ is the current operational motto. Pushing efficiency to what is possible is the goal. Leverage the most valuable asset operations has: an experienced operations staff. This blog series (Introduction) starts with the value proposition to empowering operational staff.

To empower operators you must have the necessary tools. Legacy correlation engines do have value. Though new concepts can make the job so much easier (RID). Correlated, meaningful events can be remediated in an automated fashion. These tools provide a force multiplying affect that drives operational success (KOALA). Improving efficiency is also finding what falls in the cracks. Key is leveraging historical data to find chronic issues. Staff fix issues as soon as possible so that problems do not compound over time (HELP). Operational managers and service assurance administrators need the reporting capabilities. They must know which tools are helping and by how much. They can review the data and re-invest in productivity gains (Reporting).

Empowering operators is not an easy concept to adopt. Understanding the variety of service assurance functionality will challenge the most experienced people.  Key to building a successful strategy is learn how to leverage knowledge. Improving efficiency by a “Sum of All” approach is best practice. Understanding the necessary KPI/KQIs empower administrators to empower operators.

Choosing vendors for fault management is a time consuming chore. “Black Box” vendor say one time. “Open Standards” vendor say another. You must make the best choice for your company to move forward.

If you are interested in learning more about the “Empowering Operators” concept, please schedule a meeting with Monolith using this link.

Concepts, Monolith Software

Avoiding the ‘Integration Tax’ with AssureNow


Today businesses are more interested in solving immediate problems. For service assurance, whether it be a new product roll out or chronic issues with a legacy solution, Band-Aid type solutions are becoming far too common. The result is increased complexity and long term integration requirements. Integration is becoming challenging to the extent that service organizations are selling service assurance software just for the services. IBM has recently stated in the TM Forum publication that services are 80% of the cost in a new solution. Other vendors have an innovated approach providing device and vendor integration as part of maintenance; but this causes months of delay for delivery and kills agile business capability. The only real option is a Unified Service Assurance method that AssureNow embodies. This brief illustrates the functionality included with AssureNow that requires not integration services, only administration.


The AssureNow platform is completely controlled though a browser agnostic plugin-free web interface. All command-and-control capabilities such as start/stop components, reading log files, and setting up users is done through this interface. Installation and updates are done directly through the internet enabling your administrative tasks to remain at a minimum while maintaining AssureNow. All discovery and configurations can be completely automated, and individual devices can also be added manually. The platform was created to minimize maintenance and allow administrators to focus on growing AssureNow’s value, not just maintaining professional onsite services.


AssureNow Web-based Graphical User Interface



Online Integration Library (300+ Vendors)



The fault management functionality with AssureNow supports a complete end-to-end solution monitoring faults from any vendor or device type. Rules are packaged by vendor from our online library housing 300+ and growing. New vendor packages can be created by customer (simple Perl syntax), Monolith (as part of a support contract), or a 3rd party vendor or partner. Importing MIBs is made simple by exporting trap rules. Rules writing is not required. AssureNow’s correlation capabilities are extensive and include everything from simple point-and-click policy creation to RCA. AssureNow includes knowledge management, micro-correlation, runbook automation, and cluster-based correlation capabilities, again without the ‘integration tax’ requirements of competitors. Fault management is the nexus of your incident management process and AssureNow provides countless capabilities without integration requirements.


Fault Management Event List with Tools



Fault Management with Mediawiki Knowledgebase




Performance management is important to understand the quality and capability of a company’s resources under management. Proactive collection of key performance indicators (KPIs) and key quality indicators (KQIs) allow real-time analytics to generate alerts to the fault management. Baseline, burst, or capacity management analytic-driven alerts provide immediate assistance in keeping your operations performing proactively. Collection can be done by point-and-click polling policies leveraging SNMP MIB information manually compiled or as part of a 300+ vendor integration library available online 24×7 via the support site. Custom dashboard can be created and exported as reports to provide holistic monitoring of your resources. Complicated rules and formulas are not required to collect, alert, and report on the holistic environment.


Performance Management Performance Report



Performance Management Availability Report



Topology management allows the discovery and collection of information about a company’s environment. Performing basic interface and device configuration inventory allows change management and alerts to operations as faults. With discovery, the system will automatically update polling devices and interface while allowing the service assurance solution to remain polling the network holistically. Leveraging discovered layer 2/3 topology data, dynamic point-and-click maps will allow a company to visualize the network and perform fault root cause analysis (RCA). From inventory, configuration, topology, and RCA — everything is available out-of-the-box without the need to write new rules.


Topology Management Topology Device Map


Topology Management Root Cause Analysis & Downstream Suppression




Viewing the company network from a customer and service prospective can be a challenge from most legacy providers, but not for AssureNow. Simple point and click service builder allows the administrator to create service quality management (SQM) metrics as well as business impact analysis correlation to alert operators to a higher level of priorities. Service-oriented visualizations are simplified to give a birds-eye view of who is impacted, where, and why. Generating powerful business views and reports are all possible without the need for rules, developers, and professional services — just the internal subject matter expertise.


Service Management Service Builder


Service Management Service Viewer



Monolith’s AssureNow is built for maximum customization, while its out-of-the-box value allows for powerful capabilities without additional integration and customization. No rules required, an administrator can point-and-click the way to success. Having a completely unified and scalable service assurance solution is challenging, but AssureNow simplifies things by offering a solution with a single code base so the ‘integration tax’ is no longer a challenge.


Best Practices

Friends Don’t Let Friends Buy Black Boxes!

The best solutions to every problem exist, service assurance is no exception. However, the popularity of black box solutions does not mean they are the BEST solutions.


For me, the seminal event in fault management was the rise of Micromuse’s OMNIbus. This was the first solution that I had heard of, which allowed operations the freedom to choose their own policies. Event management, as introduced by Micromuse, was a blank page and as head of tools at a service provider, I was able to embed my process and procedures into the tool. Micromuse was not very proactive however. Another open tool addressed this lack of proactivity with state-based correlation (even custom micro-correlation) and was called NerveCenter. Both tools are still around, but time has not been friendly. They are both service costly and the rules and customizations are not portable. Their modular design does not bode well for the Cloud designs and infrastructure popular today, but they certainly disrupted the marketplace with their process flexibility in the 1990s <year>.


Service Assurance vendors today are more interested in providing closed code, fixed value, black box solutions that while are more simple to use (appliances, no rules, etc) have lost their flexibility. Micromuse’s rules were complex, propriety, and stagnant – but that is no reason why the “rules” and “open” concept should be abandoned. Open platforms CAN be more complex, but if they also embrace standards and have portable designs. Rules done right can be an enabler not an anchor.


The goal of service assurance vendors should be keeping the easy things easy and difficult ones possible (as said by Larry Wall creator of Perl). The service assurance industry needs to remember who the customer is. They are sophisticated, and they can drive their own value by making the decisions on when customization is required and valuable. We need to empower operations to work better, faster, and cheaper — not replace them with black boxes.

Monolith Software believes in these concepts and empowers the customer with its AssureNow solution. It’s our goal to keep the easy stuff easy and the difficult possible and let them choose their own path — supporting them all the way through it. Customers should demand open and transparent vendor technology, and value should be seen early, often, and from the top to the bottom. A tool only becomes a solution when it meets a business need.

Look for more blogs on this “Empowering Operators” series (using the tag on the left of the blog).

If you are interested in learning more about this best practice discussion, please schedule a meeting with Monolith using this link.

If you are interested in learning how AssureNow helps customers with a unified service assurance solution, check out our case study with Oracle here.


Best Practices

Understand and Promoting Operational Efficiency KPIs & KQIs


Ask any service provider if they think operations could and should be doing a better job, nobody knows for sure, that’s the challenge. Disparate and legacy service assurance muddy the water to the point that nothing can be proved and confusion is the result. To get serious on improving operations you must first define “better” which is why key performance and quality indicators exist. This blog focuses on two specific sections that are designed to help operations track their success. Process success can be determined from mean-time-to respond and repair metrics. Determining resource efficiency and allocation through measurement units, then leveraging those baseline metrics will allow you to perform capacity planning and predict future growth.


Your incident management process should be designed to minimize response to incidents, reduce impact to customers, and shorten maintenance repair times — eliminating business dispatch cost. Nirvana is possible, but it requires discipline and an iterative approach to improvement based upon tracking the data your service assurance platform is generating. Below are key indicators that your service assurance platform should provide to you:

  • Average Incident Response Time (KPI): time it takes from fault occurrence to an operator acknowledging and taking ownership of incident. Only customer impacting problems should be tracked and averaged across 30 minutes to 4 hours maximum. This KPI is critical to know how attentive you staff is to trouble. A scatter graph of the raw data would provide an understanding how often problems fall through the cracks.
  • Average Incident Clear Time (KPI): time it takes from fault occurrence to fault clear. Averaging across 30 minutes to 4 hours to maintain as much granularity as possible. This KPI is critical to understanding how long your incident lifecycle is. A scatter graph of the raw data would provide an understanding of the distribution regarding “quick to solve” versus “long to solve” problems. Splitting this KPI by how much time is consumed by Tier (1, 2, or 3) would provide better granularity and business intelligence value.
  • Repeatability (KPI): the percentage of event type and unique incident occurrence rates from day, weekly, month, quarter, and year. This KPI provides the business a better understanding of what faults repeat and how much correlation and automated remediation can be affective.
  • Proactivity (KQI): the percentage of problems (non-service affecting) to outages (service affecting) by day, weekly, month, quarter, and year. This KQI points out whether your operations is achieving its goal of preventing customers from being impacted by capacity or technology failures.
  • Average Customer Notification Time (KPI): the time it takes from service impact to customer proactive notification. This is typically calculated across a month and part of most SLO/SLAs set with the customer from legal. This KPI is important for SLA compliance, as well as customer expectations to be informed during service interruption.


Knowing the efficiency of your human resources is vital to improving operational performance and productivity. Not all operators are alike; there are different tiers, skill sets, and technology verticals, but the numbers can be useful averaged over time. Proper knowledge management and correlation can increase performance. However investments are required, need to be tracked and results are critical to understand their success. Below are key indicators that your service assurance platform should provide to you:

  • Knowledge Growth (KPI): this value is the amount of intellectual property being embedded into your service assurance platform from knowledgebase articles and remediation policies divided by man hours required to implement them. This KPI will let you know how much investment is being made by the team, tier, and/or operator. The graph will indicate how much smarter and automated you are becoming over time.
  • Knowledge Hit Rate (KPI): this value (percentage) is how often intellectual property is being used in conjunction with resolving an incident. Knowledge is useless if never leveraged. Understanding how often knowledge is successfully used provides a better understanding of how disparate the knowledge is or how much knowledge investment is being wasted.
  • Knowledge Efficiency (KQI): this value (percentage) is how effective is knowledge in resolving problems. This KQI provides immediate feedback to the effectiveness of your knowledge management strategy.
  • Incident Work Distribution (KPI): these values show who is performing work. Each action (acknowledgement, clear, suppress) should be counted across type (automation, manual), tier (1, 2, 3), and down to the operator. These statistics provide key information in understanding where problems are being solved and how. The goal baseline for tiers should be 25% Tier 1, 12.5% Tier 2, 6.25% Tier3 (Engineering), and everything else should be automated (56.25%). A time distribution can be done to understand when most service affecting fault occur so that “rush hour” traffic can be staffed accordingly.
  • Operator Efficiency (KQI): these values are calculated using the work distribution KPIs so you can determine by tier how much work an average operator can perform in a given day. A simple mapping of salary to work load and dividing by work expected provide key understanding of automation ROI and driving cost. Providing HR incentives for operators who are more productive are recommended.
  • Automation Efficiency (KQI): these values are similar to operator efficiency and key to showing automation ROI over knowledge investment rates.


The value of tracking operational KPIs and KQIs is simple, knowledge is power. The power to shorten response and repair time, improving customer satisfaction. The ability to optimize resource allocation, enabling you can do more with less. Empower management to separate the wheat from the chaff at budget time. Reporting operational efficiency fuels better operations and AssureNow enables you to be more successful with unified service assurance.

Look for more blogs on this “Empowering Operators” series (using the tag on the left of the blog).

If you are interested in learning more about this concept, please schedule a meeting with Monolith using this link.

If you are interested in learning how AssureNow helps customers with a unified service assurance solution, check out our case study with Oracle here.



Best Practices

ABCs of Harnessing Operational Knowledge


Harnessing internal operational intellectual property should be a prime objective for every management team. A key tenet of incident management is the post-mortem process your operational team can use to prevent chronic problems from re-occurring. If you are capturing your operational knowledge, your sins of the past can be converted into the wisdom of the future – increasing operational performance. What you must do is set the team up for success. With the correct strategy in place everyone can participate and this priority can be accomplished with a positive result.


Social engineering needs to be at the core of your strategy. Capturing and use of incentivized operational knowledge should be the first focus. Attending lengthy post-mortem meetings and review meetings do not accomplish long term operational results. Leveraging a web 2.0 or ‘crowd wisdom’ approach allows real-time interaction between team members. The only team review should be when awards are given out on who adds the most high quality knowledge or leverages the most knowledge to accomplish better operational performance. Keeping team members socializing knowledge and communicating affectively will allow for lower learning curves and better team building. The more incentives operators have to use the knowledge management solution, the more likely they will participate.


Another focus should be on ease. Any knowledge management process that is going to be added needs to dovetail fully with common workflows. Operators are busy and getting them on the same page is difficult, unless it’s easy to participate. Instant messaging and text messaging are key examples of overcoming communication hurdles. Leveraging that same concept, capturing the knowledge needs to be embedded into your incident management process. Remember that any embedding should provide immediate benefit (as discussed above) to the operator. The ease of knowledge management will minimize the cost to the user and increase participation.


The last key pillar of harnessing knowledge management is centered on investing in content creation during product development and engineering. Operators need a base of value to start with; nobody wants to start with a blank slate. If a knowledge base is created in the early phases of product deployment then it will provide the proper example of operations to build from into the future. Such knowledge can also be leveraged during the pilot phase or for testing the new product as well. With AssureNow’s export features, that knowledge can be migrated to production with a controlled package version that’s easy to install. For operations to use knowledge it must exist to demonstrate the value; and doing so in development sets the right example to the business’s commitment to harnessing that knowledge.

Empowering operators to harness knowledge management is critical to increasing operational performance. Operators can only harness knowledge if they participate. They will only participate if properly incentivized, interaction is easy, and the knowledge base is not empty. AssureNow enables operations to harness that knowledge using the RID/HELP and KOALA concepts to drive operational performance.

Look for more blogs on this “Empowering Operators” series (using the tag on the left of the blog).  If you are interested in learning more about this solution and are looking for a demonstration, please schedule a meeting with Monolith using this link.

If you are interested in learning how capabilities like advanced correlation help customers in a unified service assurance solution, check out this case study with Oracle here.

Best Practices

Correlation Should Be a ‘Sum of All Parts’ Approach


The big secret to correlation that no one likes to admit is that correlation is best when applied contextually. ‘Best of Breed’ is a popular concept that leverages this idea — buy a tool for each vertical and value proposition. While this creates value quickly, it causes enormous challenges for operations in the future. “Best of Breed” has one major failing — it causes silos. And, silos are the bane of operations.

Operators need end-to-end visibility and if the information is kept in disparate buckets, then operations limit their capabilities to compensate. Silos can only be fought by integration and professional services. This is the “integration tax” that many talk about avoiding.

AssureNow approaches this situation differently. It has all the correlation models but uses one unified engine to eliminate the “integration tax”. This blog details some of the correlation models available so you can mix-and-match them to maximize your value for incident management.

The first type of correlation AssureNow provides is what we call “base” correlation functionality. This correlation is generic, but helps reduce the event stream through common sense.

  • Deduplication – Simply put, it stacks repeating faults into a single fault with an iterative count. Many fault sources repeat errors so deduplication can reduce your event stream by up to 75%.
  • Generic Clear – Faults typically occur in two types: problems and resolutions. Having a fault system correlate and automatically clear the problem when its resolution pair comes into alarm helps operations be 100% aware of the current situation.
  • Expire – When faults do not automatically clear they tend to be informational or automatically expire. Using an ‘expire’ or ‘reaper’ policy to automatically clear/delete informational faults over time limits the event stream and prevents confusion.


Correlation can be localized to solve simple situations.   AssureNow’s “simple” correlation functionality is detailed below:

  • Heartbeat – Sometimes you simply need to track the existence or absence of an event for a period of time. Environmental alarms (temp, door, power) are typically “wait and see” procedural faults that heartbeat correlation can simplify operational responses.
  • Time of Day – Faults that occur in time of day windows can be simplified by clearing or escalating the event accordingly. Example: business hour applications must be escalated during 8×5, but are handled differently after hours.
  • XinY Threshold – Some faults only matter if they happen several times over a period of time (X times in Y minutes). Faults like interface errors are typical events that are only actionable if they happen chronically, otherwise they are informational.
  • Stacking/Disparate – Some faults are only important if you see them in concert with another fault. Faults like multiple failures in a cluster are a good example of stacking correlation.



Runbook automation (or micro-correlation) is the concept of receiving informational events that need to be researched to determine whether something is a problem or background noise and automating this determination. Individual policies are created to automate individual event types as runbooks. This correlation focuses on two areas:

  • Enrichment – Worst case the runbook policy should enrich the event with what it did and the results. This information can be leveraged by operators to solve the problem without having to research it again.
  • Remediation – Some events are easily solved through manual interaction and some events are false positives that can be safely cleared after being researched.



Tree or topology correlation is leveraging hierarchical relationships to determine a deeper understanding of the events received. Using this information, AssureNow can support a parent-child relationship between the causes or impact and its symptomatic events to reduce the noise and complexity operations have to deal with when fixing problems. Below are some examples of tree-based correlation methods:

  • Root Cause Analysis (Downstream Suppression) – Knowing where the fault fits into the device it came from and how that device fits into the network allows for root cause events to be promoted and downstream push sympathetic events to be suppressed automatically. This method reduces the amount of events in the event stream and allows operators to focus on fixing the problem instead of researching what to fix.
  • Business Impact Analysis (Customer-Circuit Correlation) – Knowing what customers and services the fault impacts allows operations to inform customers promptly and prioritize responses accordingly.


Cluster correlation is leveraging associative arrays of known faults or fault types and, based upon a percentage of certainty, generate an event or action. These clusters can be built manually by operators or engineers, discovered as part of heuristics, or leveraged in an adaptive rules engine. Cluster correlation allows root cause analysis to be achieved without inventory or topology.

The “Best of Breed” approach for fault management is now possible without the “integration tax” of multiple silos. AssureNow allows the “sum of all parts” approach to fault correlation so operations can maximize performance through minimizing their event stream to actionable events.

Look for more blogs on this “Empowering Operators” series (using the tag on the left of the blog).  If you are interested in learning more about this solution and are looking for a demonstration, please schedule a meeting with Monolith using this link.

If you are interested in learning how capabilities like advanced correlation help customers in a unified service assurance solution, check out this case study with Tele2 here.