Open Issues in Network-Centric Performance Management – Are We There Yet?

 

The second of our Roundtables was an extremely well-attended gathering, with over a dozen interested experts participating from Cisco, ADVA, OPNET, InfoVista, TELUS, TATA, and more.

The deliberations this time around centered on the experts’ responses to six questions that we had posed to them to stimulate exploration of this topic in greater detail, and to arrive at shared consensus on the issues/problems. The questions we framed were:

  1. What kinds of measurements can be made at the network-level to give insights into the service level? How may these insights be derived? (Conversely, how can service-level measurements help to detect issues at the network level?)
  2. How do you use performance monitoring data? Tradeoffs between active vs passive measurements? How much data do you keep? How and when does one aggregate this data? How is it used for triage when needed?
  3. Today, metrics are typically aggregated and averaged to make the risk of violation more acceptable. An SLA, for instance, may in many cases just get you a person at the other end, willing to work with you on your problem. How is the eco-system working to change this mode of operation? (What are the various players doing/contributing?)
  4. How does fault isolation across boundaries work? How is fault notification done? How does one determine and assign “responsibility” to the liable network segment/operator today? (Thoughts for how this “should” be done?)
  5. What is the role of standardization in enabling the industry to converge on performance management capabilities?
  6. Are some of the ensuing difficulties in network-centric performance management merely organizational (e.g. siloed operation, need to prove performance between different groups in the same operator)? What are the technical factors that induce the modus operandi we have in operators today?

Of these, it appeared that Questions 2, 3, and 6 were more open-ended, and there were a number of open questions here to which there were no easy answers. Questions 1, 4 and 5, on the other hand, had interesting perspectives, which will be shared at the NANOG56 Panel.

In the course of compiling the issues, however, we came up with a way to classify questions by category – general, operational, organizational, technical, and technological/future trends. The table below gives a collection of open-questions in this space.

Our objective in making these available is to demonstrate the various aspects of this area that still need answers, and need consensus and development within our industry. While not all of them may have immediate answers, they serve as a basis to seed further detailed discussions.

Open Questions in Network-Centric Performance Management

1 General  i)   What are key advances in performance monitoring of fundamental network-level parameters/activities that enable the proper & efficient running of a network?  ii)  How to fix performance problems efficiently and speedily?

 

iii) What bottlenecks prevent operators from sharing performance management data? What data is desired to be shared?

 

iv) What are the problems, and how the different players are playing a role in contributing to solving them? (Everyone must be able to articulate, where and how you are playing a role? What more do you envisage doing? Can the players play a better role? Who do they need cooperation from?) How does e2e happen, with better interaction between the eco-system players? This also provides part of the solution, what is needed to implement – to choose a vendor with some “X” capabilities, need to have vendors talking to one another.

 

2 Operational i)   Where are we on the ownership and sharing of performance data – how could operators in different segments (or groups within an operator) share this data (without revealing internal details or without revealing details that bog clear focus on the performance problem)? What is making this a necessity in today’s environment? What changes are occurring (in operator practices, vendor offerings, and software solutions) to facilitate that? (could also be under Technical) How much of this is really needed to give e2e performance? ii)  How does fault isolation across boundaries work? How is fault notification done? How does one determine and assign “responsibility” to the liable network segment/operator?iii) How are metrics shared across operator or organizational boundaries, if at all? If not, what, if anything is shared?

iv) Which metrics are critical for network performance & why?

v)  Today, metrics are aggregated and averaged to make the risk of violation more acceptable. An SLA just gets you a person at the other end, willing to work with you on your problem. Is the eco-system working to change this mode of operation?

 

3 Organizational/Business i)    Are some of the ensuing difficulties in network-centric performance management merely organizational (e.g. siloed operation, need to prove performance between different groups in the same operator)? What are the technical factors that induce the modus operandi we have in operators today? ii)  Who owns the measurement data? How is it exchanged between parts of an organization? What bottlenecks – operational or business – prevent operators from sharing the data?iii) Even internally today there are siloed tools across departments in the same operator. Real-time sharing of information, instead of troubleshooting based on trouble tickets, would be faster and more efficient. Where are we on that?

iv) If the operators do not expose the SLA and results, they’ll get undercut on the managed services – as the enterprise will use simple broadband connections to carry a lot of their enterprise data with reasonable quality. So, how do they prevent this?     

 

4 Technical i)    What is the role of standardization in enabling the industry to converge on performance management capabilities?  Standards like IPPM (IP Performance Monitoring in IETF) did a good job of defining metrics, PMOL did not garner much interest, and folded.   What else is happening today – NIST working on cloud-level metrics and SLAs, perhaps because of the government’s interest in using the cloud extensively?ii)  How to get consistent metrics? Get consistency in measurements? Is jitter measured the same way by different operators? How can this be standardized? What role are standards playing, if anything?iii) What are the advances in real-time collection and processing of data, and how do they aid performance management in today’s complex IP/Ethernet networks?

iv) Are there limits from software and systems? If the latter, what is the eco-system doing to remove those?

v)  What is the contribution of the network (or a network segment) to application performance? How does advanced system design enable an operator to better determine/track that?

vi)  What can be done with network-element level and segment-level measurements to aid in end-to-end? How is that done/achieved?

How are segment-level measured parameters made useful for an operator? (There is no meaningful network-centric value (of a parameter) end-to-end? It’s stitched together to support the application, and the application performance is e2e. Looking at network element-level and segment-level measurements, how does one derive something that’s meaningful end-to-end?)

vii)  Operators (or groups within operators) monitor their segment, and provide limited reports to partners, so how do you debug based on reports?

viii)  No sharing of info. by and large between ISPs today. Need consistency of SLA definitions, and even of metrics to do that reliably and seamlessly.

(If SLA definitions are not consistent, how are SLAs offered today?  Apparently they are very coarse, and largely enforced when the customer comes and informs the operator. Very little, if any, proactive performance measurement is done.)

ix) What kinds of measurements can be made at the network-level to give insights into the service level?

x)  Conversely, measurements can be made at the service level that could help to detect underlying issues at the network level? How do you translate issues at the service level, into actionable information about the network?

xi) Even if the network says it needs a delay of Xms, how is it coordinated behind the scenes, what are the challenges to doing that?

xii) When something goes wrong in the path or service, who is responsible for figuring out where the problem is? Esp. if everyone is making measurements in their segment of the network?

xiii) Where does complexity come from? Kinds of complexities? Difficulties they impose?

xiv) What are the problems, and how the different players are playing a role in contributing to this? Everyone must be able to articulate, where and how are you playing a role?

 

  Technological Advancements i)    Which advances are designed to enable proactive performance management, as opposed to reactive performance management? (What is the benefit of PM tools?) ii)  With network complexity and scale, automated network enforcement actions could be valuable. However, they are perceived as complicated and risky. What is the eco-system doing to facilitate these? Why or why not?iii) Active vs passive measurements? How much of each? Why? When?

iv) How much data to you keep? How and when do you aggregate? And, then use it for triage when a problem occurs.

v)  How to control the clutter of redundant notifications (from multiple layers)? How to correlate alarms?

 

What are your thoughts? Do you think we’ve covered the key issues? Are there aspects you come across as a hardware/software vendor, operator or software-services provider that we should cover?

Do share your thoughts below and/or after registering. Register at  http://bit.ly/QFrnaQ to be part of this ongoing dialog, and to submit questions/thoughts that we can tackle in the on-line panel and beyond!

Note that by submitting your questions, you enable our NANOG56 panel to go international (!) – as we will be able to address questions from any colleague across the globe!

So, do take advantage of this unique opportunity to participate in this important debate. We look forward to your thoughts!

If you or your company would like to get involved with this initiative, and contribute to it, please contact Dr. Vishal Sharma at vsharma AT metanoia-inc DOT com or call +1 650-641-0082.

The companies cooperating in this initiative are:

————————————————————————————————————————————————–
Metanoia, Inc. has consistently been a leader in bringing the eco-system’s focus to carrier-centric issues. If you would like to contribute to, participate in, or have a suggestion about our recent initiatives, write us at initiatives@metanoia-inc.com or comment on this blog. To be involved in the current ongoing discourse, write to Dr. Vishal Sharma at vsharma@metanoia-inc.comor call +1 650-641-0082.
 
Our industry-leading Provider Network Health Assessment Service is an amalgamation of a decade+ of experience working in the carrier-ecosystem, and uniquely designed to deliver strategic and technical expertise to operators via a series of flexible Service Packages. For details, and a representative case study visit http://www.metanoia-inc.com#NHAS. To reserve a Strategy Session to brainstorm your needs, reach us at experts@metanoia-inc.com or +1 650-641-0082