Data analytics — dare we say AIOps? — and automation rescue operations team in times of crisis

Kurt Marko Profile picture for user kmarko April 15, 2020
New product releases in the AIOps market come under scrutiny.

(via Pixabay )

The evolution of IT from physical to virtual resources, such as VMs, containers, virtual networks and cloud services, has created an explosion of operational data since each is a source of increasingly granular telemetry. The data deluge has been a mixed blessing for IT admins, DevOps teams and SREs (site reliability engineers), helpfully providing valuable information about the state of systems and applications, but turning problem solving into an endless quest to find smaller needles in an ever-growing haystack. Indeed, the volume of data and diversity of sources have made it virtually impossible to identify systemic anomalies and correlate related events. 

Evangelists of AIOps, a cringe-worthy moniker that has nonetheless become part of the lexicon, long promised products that would automatically apply advanced statistical tools, machine learning and heuristics to the task of anomaly detection, data filtering, problem correlation and trend spotting. Such ‘intelligent systems’ would allow overworked operational staff to spend more time on problem prevention and mitigation and less on data wrangling and firefighting. Recent announcements and product updates from both established companies like New Relic and VMware and smaller AIOps specialists indicate that the concept has emerged from the shadows of marketing hype into the spotlight of generally released software. 

New products illustrate the various ways companies are employing machine learning, data analysis and innovative data structures to analyze operational event streams and metrics reinforces a point I made almost a year ago, that AIOps is a product feature, not a category. As I put it:

AIOps is a natural evolution of IT infrastructure and application management software that incorporates machine and deep learning, a class of algorithms with demonstrated excellence at:

  • Digesting massive quantities of data to find and tag patterns
  • Correlate seemingly unrelated events and features
  • Flag outliers
  • Set baselines for normal operations and
  • Ascertain the probabilistically optimal set of steps to fix problems

As such, AIOps is not a platform, but a feature of many products related to IT operations and DevOps.

From APM to AIOps - New Relic expands portfolio

Many people still associate New Relic with its popular application performance management (APM) software, however, the company has spent the past decade strategically evolving itself into a holistic IT monitoring platform. According to New Relic’s Founder and CEO, Lew Cirne, speaking at the company’s Q3 2020 earnings call:

New Relic exists to help our customers create more perfect software, digital customer experiences and businesses. As digital becomes the primary channel for how business is done, companies simply cannot afford that downtime or poor software experiences. … We serve the teams responsible for the performance of digital systems. When something goes wrong, they don't want to have to open up a collection of disparate tools to pinpoint the issue. During the system disruption they want a single platform to see all of the data in context with a unified and simplified user interface.

kurt one
(Source: New Relic investor presentation, February 4, 2020 )

The culmination of the company’s long-term transformation is the New Relic ONE platform, which Cirne described this way:

We want a platform that people can bet their digital business on to make sure that they deliver more perfect software. And it's a combination of the openness of getting all the data into the New Relic platform, not just the data that comes from our agents; the connectedness of showing the relationship between the application, the infrastructure and the logs, and the end user experience. And finally programmability to say, there is no use case that you can't pursue in New Relic as it relates to observability and delivering the visibility present at the right way to help you deliver more perfect software.

kmn 2
(Source: New Relic investor day presentation, December 12, 2019. )

New Relic estimates that providing a comprehensive operations monitoring and analysis platform expands its total addressable market (TAM) by 7-fold over its historical niche in APM.

kmn 3
(Source: New Relic investor presentation, February 4, 2020 )

The final piece of the New Relic ONE puzzle is the just-released New Relic AI, a suite of AIOps features based on technology acquired last year from SimplifAI. According to Guy Fighel, GVP, AI product GM and SignifAI co-founder and CTO, the AIOps capability addresses three problems facing today’s site operations and DevOps practitioners: 

  • A vanishingly small signal-to-noise ratio in alert streams and a concomitant need to prioritize the most significant alerts.
  • The difficulty of grouping and correlating issues, particularly with modern applications composed of micro and cloud services. However, operations teams don’t want an AI black box, but adopt the principle of ‘trust, but verify.’ They require algorithmic transparency to see how a system makes correlations and understand the linkages.
  • The need to proactively identify and mitigate problems before they turn into wholesale outages and reduce the mean time to repair (MTTR).

New Relic’s feature set will be familiar to anyone evaluating other AIOps products like those I first wrote about back in 2018. However, two areas where New Relic distinguishes itself are the database used to store aggregated data and the ease of incorporating existing data sources and external IT operations workflows.

  • New Relic AI uses its proprietary NRDB to store various data types, which Fighel says allows fast access to raw and summarized data, easy data extraction and speedy generation of reports and visualizations. (For details on NRDB, see here).
  • NR AI works with many popular IT operations products and can ingest data from PagerDuty, New Relic Alerts, Splunk, Prometheus, Grafana, Amazon CloudWatch. Users can add other sources via a REST API. The product can then forward analyzed results to incident management software like PagerDuty, ServiceNow, OpsGenie and VictorOps, including incident context and ML-generated guidance. NR AI also can feed notification and collaboration tools popular with operations teams like Slack channels.

Fighel says that NR AI ships with generic pre-build ML models that allow users to see benefits immediately, however, it also improves models over time based on the operational data in a particular environment. For example, some beta testers reported an 80 percent reduction in spurious, irrelevant alerts after deployment.

VMware promises self-driving operations

Like New Relic, VMware has spent the last few years building out a platform of infrastructure and application management software that has grown to include AIOps features. As with New Relic, VMware acquired critical technology from a smaller startup, Wavefront, which VMware bought in 2017. Over time, VMware has infused Wavefront technology into several products, the most recent being Tanzu Observability and vRealize Operations Cloud

Many of their respective features undoubtedly trace to Wavefront AI Genie, which used ML to detect performance and event anomalies, reduce the number of false positive alerts (what New Relic calls “alert noise”) and predict capacity needs based on performance and resource usage trends. Like New Relic AI, Tanzu Observability includes an enormous variety of readymade integrations that incorporate data from most popular infrastructure and cloud services.

Just released, Operations cloud promises what VMware calls “self-driving operations” by using AI models to provide:

  • Continuous infrastructure optimization by automatically placing and rebalancing workloads based on demand and business requirements.
  • Cost optimization via workload sizing and placement decisions on VMware cloud platforms, including Cloud on AWS.
  • Predictive analytics to identify and analyze potential problems before they cause an outage or degrade performance.
  • Event log analysis via integration with vRealize Log Insight.
  • Automated enforcement of regulatory and security policies across on-premises and AWS VMware deployments.

The Operations Cloud product complements VMware’s previously released VRO 8.1 software, but adds the convenience and minimal overhead of SaaS.

My take

New Relic and VMware are some of the latest and largest companies to reveal their AIOps wares, however, as regular diginomica readers know, they are joined by a host of smaller companies, including two recently profiled by Jerry Bowles:

New Relic and VMware also exemplify contrasting approaches to the AIOps market.

  • New Relic uses AIOps as part of a comprehensive, infrastructure- and application-agnostic operations management platform that can simultaneously replace monitoring and alerting point products and connect with other ITops tools for incident/ticket management, personnel notification and team collaboration.
  • VMware adds AIOps to an existing infrastructure management portfolio as a point of competitive differentiation and way to reduce tool sprawl, i.e. point products, in VMware shops.

Each company reinforces my earlier contention that AIOps is a feature, not a distinct product category and that including AI, ML and statistical analytics in infrastructure and application management software has become table stakes. Vendors that don’t ante up will soon be out of the game.

A grey colored placeholder image