Eco-System Player Views on Performance Management: Nuggets from Our International Panel

Readers of this blog know that as part of our focus on performance management and monitoring, where we’ve spearheaded an industry intitiative (via Roundtables, discussions, conversations, and panels) on the subject, we recently concluded a very interesting and well-attended on-line international panel titled “Smart Monitoring and Performance Management for Operational Efficiency!” on July 17, 2012.

In this post, I’m excited to share with you some some notable points made by the panelists, by taking you through them one-by-one.

Let’s start with Chris Woodfield of Twitter who explained what performance management means to an over-the-top (OTT) operator like Twitter. Chris began by focusing on which of the standard network metrics are most important for which applications. For instance, financial trading applications put a premium on latency and availability, while interactive video requires jitter and packet loss to be below defined thresholds. So, the network’s challenge is: how does it ensure that the metrics for diverse applications running over it are met, and, moreover, that it can detect impaired application performance to take action to improve performance?

Chris’s first observation was that, unlike traditional operators, for whom the network is the product (and so network metrics are the metrics), the Twitter network (a combination of owned and leased infrastructure) exists to serve the Twitter application. Thus, the product is the application, and the relevant metrics are application-layer performance parameters, like application response time, page load speeds, L4 connection rates, L4 connection handling speeds, and so on. Twitter deals with very-high L4 connection rates, which stress load balancers, and flow-based monitoring tools. (Interestingly, a lot of traffic on Twitter comes from mobile users/feature phones.)  So, in Twitter’s case they typically start with higher-layer solutions to performance problems, and go to network-level solutions later.

By contrast, Geoffrey Holan of TELUS pointed out that for infrastructure operators, end-to-end performance management is critical because they carry a large number of diverse applications, and so must be able to ensure quality uniformly for all of them.                                                                                                                                   

In the case of Telus, they began a transformation to a converged IP/MPLS core almost a decade+ ago, with the first conversion being their core network. Over the years, this process moved outwards, until today, when their converged IP/MPLS network extends all the way to the metro and access segments. It is interesting to note that this network convergence was reflected in a conscious (and deliberate) combining of the operations staff on the wired and wireline sides. This resulted in an expansion of the skills of the staff and much tighter integration of operational processes. The result was a leaner, but more experienced, organization that could handle performance problems quickly and completely, providing a faster diagnosis and resolution.

It was also fascinating to learn how social media has altered customer response times (!), and brought about an integration of that into an operator’s customer service. For instance, the operations staff now, not only has to worry about network performance, but also keep track of what their customers are saying on social media outlets, such as Twitter. An excessive delay in the text message service today, shows up as dissatisfied tweets from affected customers. In this age of transparency,

Meanwile, Ning So of TATA Communications shared some valuable thoughts coming from the perspective of the handful of big problems facing operators today, and what is needed from the eco-system to solve them. Ning observed that one of the biggest challenges in seamless end-to-end performance management is the mobility of the client/user as well as the mobility of the data (which is now often in a multi-tenant, multi-protocol data center; to learn more of the role of the data center see here http://bit.ly/O3SAqe). With this complexity of the middle-mile network, the existing performance management tools often break down across network boundaries.    

The second challenge is to coordinate performance across different operator domains. For example, a user expects his/her applications to work seamlessly – on the road, in the office, and at home. Making that happen, however, is much easier said than done! This is because support for broadband data roaming, and performance management on top of that is practically non-existent today, in his view (Note to eco-system players, if you disagree or have product that proves otherwise, let’s hear your views in the Comments!). 

Then there were lessons coming from the eco-system players. In particular, Vikas Trehan of InfoVista and Cengiz Alaettinoglu of Packet Design talked about the interaction of the application with the network, especially in an age of increasing network complexity, and the role of network routing in improving application performance.

Vikas pointed out that with greater infrastructure complexity today (multiple vendors, leased + owned capacity, a slew of technologies, multiple transport media, and a mix of topologies) operators must worry about data center (application servers, load balancing), IP transport (network engineering), mobile packet core (multi-vendor management), IP backhaul (service quality, latency) and the radio network (cell performance & engineering)! Whew! So, an operator could be managing one of more of these network segments and must coordinate activity with upstream and downstream providers. What this requires is OSS co-operation together with the ability to have visibility into all these segments, plus the ability to piece information from these together to provide suitable analytics to ops and engineering.

Cengiz made the valuable observation that when considering e2e performance management, it is often assumed that network routing is not a factor, and is working perfectly!  Application performance, however, can be (and is) seriously impacted by poor or incorrect routing. For instance, voice call quality can suffer if voice calls are inadvertently routed over sub-optimal paths, a fact that may not be realized immediately by the operator, unless proactive monitoring and analysis is done. This can save the operator precious opex by simply giving insights into how to improve routing! Cengiz shared a case-study that Packet Design did working with a large financial customer. They analyzed the routing of calls and call quality post-facto, and discovered (when pairing that information with corresponding routes for the calls) that many of the calls were being routed over highly sub-optimal paths (they were switched there during an outage, or a maintenance window, but never switched back!). Such discovery helped the customer to correct routing, and thereby fix poor application-level performance.

Finally, Aamer Akhter of Cisco spoke about identifying that there is a problem, and where the problem is. He discussed the philosophy behind MediaNet, an application-level service (that, nonetheless, talks to the underlying systems, pulling stats from them) that allows problems in the end-to-end path to be discovered. This is contingent, however, on the underlying systems willing to share that information, which could be an issue when crossing different domains, so, while the capability exists, this is not a completely solved problem yet.

To get detailed insights into what these experts shared, please register on the page here http://bit.ly/LknqIs, and participate in the dialog! You will have the privilege of receiving a sequence of emails that takes you through the evolution of the thought-process in Roundtables leading up to the panel, and provides you Panelist’s slides as well, thus giving you a complete picture, and equiping you to make contributions of your own – which we truly welcome!

To read the podcasts and posts that built up to this Panel and beyond, please go here.

If you or your company would like to get involved with this initiative, and contribute to it, please contact Dr. Vishal Sharma at vsharma AT metanoia-inc DOT com or call +1 650-641-0082.

The companies cooperating in this initiative are:

————————————————————————————————————————————————–
Metanoia, Inc. has consistently been a leader in bringing the eco-system’s focus to carrier-centric issues. If you would like to contribute to, participate in, or have a suggestion about our recent initiatives, write us at initiatives@metanoia-inc.com or comment on this blog. To be involved in the current ongoing discourse, write to Dr. Vishal Sharma at vsharma@metanoia-inc.comor call +1 650-641-0082.
 
Our industry-leading Provider Network Health Assessment Service is an amalgamation of a decade+ of experience working in the carrier-ecosystem, and uniquely designed to deliver strategic and technical expertise to operators via a series of flexible Service Packages. For details, and a representative case study visit http://www.metanoia-inc.com#NHAS. To reserve a Strategy Session to brainstorm your needs, reach us at experts@metanoia-inc.com or +1 650-641-0082