AIOps - hype aside, ML-infused analytics tackles problems in heterogeneous systems and applications

Kurt Marko Profile picture for user kmarko November 20, 2018
Summary:
AIOps is a thing - apparently - but don't let the marketing hype machine turn you off completely. There's plenty of business value to be generated from the judicious combination of current technologies used for optimizing IT operations.

robot AI
Sophisticated statistical and machine learning analysis of the data inundating enterprises from all directions is typically used to improve business insights about new products, marketing programs or long-term strategies. For business leaders, the software and infrastructure required to optimize and monetize their data is a valuable investment, but what if the same type of data analytics could be used to optimize and operate the infrastructure itself? Such is the promise of a growing class of IT tools variously called IT Operations Analytics (ITOA) and Artificial Intelligence Operations (AIOps).

While the monikers drip with marketing hype, (editor's note: what AIOps means about spilled beer in Yorkshire - check it out) the concepts behind them, particularly the application of data aggregation and lakes, statistical techniques and machine learning to infrastructure management enable powerful new capabilities for IT operations and applications teams, namely:

  • End-to-end measurement of user experience and application performance.
  • Event correlation across multiple platforms to accelerate infrastructure troubleshooting and incident forensics.
  • Predictive trending of performance and capacity to inform proactive decisions about capacity expansion and configuration optimization.

Every major IT software and equipment vendor has hopped on the AI bandwagon, often with hyperbolic claims of insightful forecasting prowess that far exceed the limitations of mere human admins. Indeed, Cisco has wrapped its entire portfolio in a blanket of AI and ML goodness. Unfortunately, the financial ambitions of horizontally integrated broad-line suppliers often lead to a myopic view of enterprise infrastructure in which achieving the full slideware vision requires top-to-bottom adoption of the vendor’s products.

Instead, most organizations use a mix of vendors in various locations and for different categories of IT infrastructure. The hodgepodge of today's IT, as opposed to the idealistic simplicity depicted on marketing slides, requires management products that can both span the various infrastructure layers, from network edge to core, and devices, whether mobile clients, wireless access points or data center systems, with features that aren’t tied to a particular product vendor.

There is a growing number of companies developing intriguing IT operations products that apply data analytics to network, application and cloud/container infrastructure. The following is a (non-exhaustive) capsule of several I've encountered this year.

Network analytics

Nyansa is a startup that has been mostly invisible since its founding in 2013, although it did attract the attention of Intel Capital. The company's Voyance product suite (see image below), with modules for LANs, WANs and clients, analyzes network performance by collecting data from every device and network layer. When used together, the Voyance package can measure real-time and historical client and application performance from end-to-end by aggregating and analyzing device data from the entire network stack.

While simple in theory, the concept underlying Nyansa’s SaaS products is complicated to implement since its agentless design requires crawling all manner of devices and product implementations, whether they be Cisco wireless LAN controllers and access points, VMware virtual application servers or Juniper WAN routers. The data sources Nyansa claims to support are exhaustive, including:

  • Wi-Fi systems from multiple vendors
  • LAN infrastructure including routers, switches and RADIUS, DNS and DHCP servers
  • WAN circuits
  • Client and IoT devices
  • Applications using exposed APIs

The collected data is correlated to indicate related transactions or application streams and analyzed to create performance baselines and detect anomalies. The results can be used for network management troubleshooting and design optimization, application support, and problem analysis and IT service benchmarking and trend spotting.

Nyansa currently has about a hundred customers spanning retail, technology manufacturing and services, higher education and health care that collectively analyze almost 11 million devices generating trillions of events. One of those is Lululemon, the Canadian company famous for its athletic wear. According to Amit Pindoria, Senior Network Engineer at Lululemon, Nyansa has proven invaluable at measuring performance from the user’s perspective and helping to quickly find the source of problems.

Pindoria says that after letting Voyance gather data, it showed him performance degradation on a wireless network that the IT team wasn’t aware of because staffers were still using wired connections and wireless users didn’t know any better. An added benefit is that Nyansa provides Pindoria and his team with the data to justify the efficacy of changes to management and to justify future improvements. “Using Nyansa helps me immensely,” says Pindoria, adding that he is hoping to expand deployment from the initial few Lululemon offices throughout the company.

voyance AIOps

Nyansa has a growing list of competitors applying the same sort of aggregated data analysis to infrastructure management, including:

  • LiveAction, which focuses on network equipment by collecting data using standard protocols like SNMP, NetFlow, IPFIX and device-specific APIs. Its LiveInsight product uses ML to baseline normal behavior and patterns for network devices, applications and users (clients) to enable predictive maintenance and capacity expansion, detect anomalous performance and behaviors and prevent or mitigate application or service disruptions.
  • Extrahop uses advanced data analysis including ML to monitor network performance, security and application performance and is particularly useful in uncovering performance bottlenecks and other disruptions on SaaS applications where the data path spans internal networks, ISP WAN circuits and cloud provider infrastructure. Initially designed for service providers needing low-level packet analysis, Extrahop has extended up the network stack to provide a modicum of APM features that can be derived from network data.

More mature management and performance monitoring products from Cisco, ManageEngine, NetScout, and Solarwinds have augmented their supported data sources over the years and are beginning to employ data analytics. Indeed, Cisco has added Impressive technology from the Tetration acquisition that is focused on security, but extensible to network analysis. In contrast, Splunk has come at intelligent operations from its foundation in event log collection and filtering by advanced analysis techniques including an ML toolkit for creating custom benchmarking, forecasting and anomaly detection tools.

APM and cloud

Data-driven analytics is also being applied to two other critical IT management categories: application performance monitoring and cloud cost management. Many APM vendors have applied machine learning to improve detection of performance anomalies, automatically diagnose their source and predict future performance, and hence the need for added resources, based on historical usage patterns. A sampling of products applying machine learning to APM includes:

  • AppDynamics which applies data analytics to find the root cause of performance problems and measure baseline performance and predict deviations.
  • Datadog which uses ML for both network analysis of distributed applications via its Watchdog product and for detecting application performance anomalies.
  • Lightstep which I profiled last year as an innovator in APM for distributed, microservice-based applications.
  • MicroFocus uses machine learning in part to provide predictions and recommendations that can help application teams prevent problems and increase efficiency. Since its acquisition by HP, the company is also integrating APM with IT operations management to improve end-to-end monitoring and troubleshooting.
  • Microsoft Application Insights uses machine learning in its Smart Detection feature that proactively warns of performance degradation or abnormal behavior patterns. With its acquisition of Kusto, now called Microsoft Flow, the company also applies ML to Splunk-style log analysis.
  • Nastel uses ML and other analytics techniques in its AutoPilot product that correlates events and data from multiple sources to provide end-to-end transaction and performance monitoring across hybrid cloud, legacy, and mobile environments and used to isolate the source of performance problems and identify trends.

As I detail in this column, machine learning is also being applied to multi-cloud cost management and optimization to identify inefficiencies and changes that can significantly reduce cloud usage and bills.

My take

The glut of data finding its way into every corner of the enterprise is only useful when it is analyzed, summarized, intelligently extrapolated and ultimately, acted upon. Although the impetus for incorporating data analytics and various ML and AI techniques into organizations has rightly been to improve business results, we’re entering the second phase of usage in which the same approaches are being used to enhance internal operations. As one of the primary sources of enterprise data, IT shouldn’t let it go waste.

The rapidly maturing category of ITOA and AIOps products can tremendously improve all aspects of IT operations and service delivery from the lowest layer of network infrastructure to the user experience with individual applications. IT leaders should put evaluating and testing data-driven management and troubleshooting tools on their list of 2019 priorities.

Loading
A grey colored placeholder image